Spark on YARN--WordCount、TopK
2024-10-07 17:56:13
原文地址:http://blog.csdn.net/cklsoft/article/details/25568621
1、首先利用http://dongxicheng.org/framework-on-yarn/spark-eclipse-ide/搭建好的Eclipse(Scala)开发平台编写scala文件。内容例如以下:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object HdfsWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(args(0)/*"yarn-standalone"*/,"myWordCount",System.getenv("SPARK_HOME"),SparkContext.jarOfClass(this.getClass))
//List("lib/spark-assembly_2.10-0.9.0-incubating-hadoop1.0.4.jar")
val logFile = sc.textFile(args(1))//"hdfs://master:9101/user/root/spam.data") // Should be some file on your system
// val file = sc.textFile("D:\\test.txt")
val counts = logFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
// println(counts)
counts.saveAsTextFile(args(2)/*"hdfs://master:9101/user/root/out"*/)
}
}
2、利用Eclipse的Export Jar File功能将Scala源文件编译成class文件并打包成sc.jar
3、运行run_wc.sh脚本:
#! /bin/bash
SPARK_JAR=assembly/target/scala-2.10/spark-assembly_2.10-1.0.0-SNAPSHOT-hadoop2.2.0.jar
./bin/spark-class org.apache.spark.deploy.yarn.Client \
--jar /root/spark/sh.jar \
--class sh.HdfsWordCount \
--args yarn-standalone \
--args hdfs://master:9101/user/root/hsd.txt \
--args hdfs://master:9101/user/root/outs \
--num-executors 1 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1
附:
TopK(选出出现频率最高的前k个)代码:
package sc
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object TopK {
def main(args: Array[String]) {
//yarn-standalone hdfs://master:9101/user/root/spam.data 5
val sc = new SparkContext(args(0)/*"yarn-standalone"*/,"myWordCount",System.getenv("SPARK_HOME"),SparkContext.jarOfClass(this.getClass))
//List("lib/spark-assembly_2.10-0.9.0-incubating-hadoop1.0.4.jar")
val logFile = sc.textFile(args(1))//"hdfs://master:9101/user/root/spam.data") // Should be some file on your system
val counts = logFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
val sorted=counts.map{
case(key,val0) => (val0,key)
}.sortByKey(true,1)
val topK=sorted.top(args(2).toInt)
topK.foreach(println)
}
}
附录2 join操作(题意详见:http://dongxicheng.org/framework-on-yarn/spark-scala-writing-application/):
package sc
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SparkJoinTest {
def main(args: Array[String]) {
val sc = new SparkContext(args(0)/*"yarn-standalone"*/,"SparkJoinTest",System.getenv("SPARK_HOME"),SparkContext.jarOfClass(this.getClass))
//List("lib/spark-assembly_2.10-0.9.0-incubating-hadoop1.0.4.jar")
val txtFile = sc.textFile(args(1))//"hdfs://master:9101/user/root/spam.data") // Should be some file on your system
val rating=txtFile.map(line =>{
val fileds=line.split("::")
(fileds(1).toInt,fileds(2).toDouble)
}
)//大括号内以最后一个表达式为值
val movieScores=rating.groupByKey().map(
data=>{
val avg=data._2.sum/data._2.size
// if (avg>4.0)
(data._1,avg)
}
)
val moviesFile=sc.textFile(args(2))
val moviesKey=moviesFile.map(line =>{
val fileds=line.split("::")
(fileds(0).toInt,fileds(1))
}
).keyBy(tuple=>tuple._1)//设置健
val res=movieScores.keyBy(tuple=>tuple._1).join(moviesKey)// (<k,v>,<k,w>=><k,<v,w>>)
.filter(f=>f._2._1._2>4.0)
.map(f=>(f._1,f._2._1._2,f._2._2._2))
res.saveAsTextFile(args(3))
}
}
最新文章
- Docker 学习之命令详解
- TFS 自动同步Server 端文件的批处理命令
- Struts2的处理结果(二)——处理结果的类型
- 【PHP分享】Windows tail工具分享
- mysql 创建函数set global log_bin_trust_function_creators=TRUE;
- KMP poj
- veridata实验例(3)验证veridata发现insert操作不会导致同步
- 关于WebBrowser.DocumentCompleted事件
- CSS3 3D笨蛋教程
- 都能读懂的css3 3D变形效果
- iOS Touch ID 身份认证
- Ansible系列(三):YAML语法和playbook写法
- JSON(一)——JSON与JavaScript的关系
- apt-get install 出问题怎么办?
- koa文件上传中间件——koa-multer
- JVM安全点操作与测试小记
- C#两个实体之间相同属性的映射
- mongodb安装和运行
- ES6走一波 Proxy/Reflect
- 【idea】之使用SVN一些技巧
热门文章
- 基于vue2.0打造移动商城页面实践 vue实现商城购物车功能 基于Vue、Vuex、Vue-router实现的购物商城(原生切换动画)效果
- 自增主键与UUID的优缺点
- Logback配置,error和普通日志分离
- β版本apk下载地址及源代码github地址
- 2018-8-10-win10-uwp-横向-AppBarButton
- 三、TortoiseSVN 单独拉取项目某个文件
- 转载:java面试题(一)
- Scratch 少儿编程
- Matplotlib_key_point
- 2019HDU多校训练第三场 Planting Trees 暴力 + 单调队列优化