1.sbt是什么

对于sbt 我也是小白, 为了搞spark看了一下scala,学习scala时指定的构建工具就是sbt(因为sbt也是用scala开发的嘛),起初在我眼里就是一个maven(虽然maven我也没怎么用),后面构建2个项目之后,发现还是蛮强大的,就是学习成本有点高。

哎,但是现在什么东东没有学习成本呢。扯远了,0.13版本的入门之旅参考:http://www.scala-sbt.org/0.13/tutorial/zh-cn/index.html

2.assembly是sbt的一个打包插件

下面是一个入门之旅里面的例子:

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, ).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
#simple.sbt
name := "Simple Project" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
# Your directory layout should look like this
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala # Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.-1.0.jar # Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
--class "SimpleApp" \
--master local[] \
target/scala-2.10/simple-project_2.-1.0.jar
...
Lines with a: 46, Lines with b: 23

到目前为止,都很happy,因为都能顺利通过,因为你依赖的spark库,在spark master和worker上面都有。但是如果依赖mysql的jdbc这些第三方库, 只使用sbt的 package 命令打包,是不会把这些第三方库打包进去的。

这样在spark上面运行就会报错,而且如果你有多台wroker机器的话,需要把其它机器都撞上同样的运行环境(jar包依赖)。

所以,这个时候我们就需要sbt的assembly pulgin。它的任务,就是负责把所有依赖的jar包都打成一个 fat jar。

但是,它也不是万能的,特别当你遇到重名的文件时候,就非常尴尬。

3.assembly如何解决 SBT Assembly - Deduplicate error & Exclude error

我们先来看个错误例子:

[error]  error was encountered during merge
[trace] Stack trace suppressed: run last *:assembly for the full output.
[error] (*:assembly) deduplicate: different file contents found in the following:
[error] /Users/qpzhang/.ivy2/cache/io.netty/netty-handler/jars/netty-handler-4.0..Final.jar:META-INF/io.netty.versions.properties
[error] /Users/qpzhang/.ivy2/cache/io.netty/netty-buffer/jars/netty-buffer-4.0..Final.jar:META-INF/io.netty.versions.properties
[error] /Users/qpzhang/.ivy2/cache/io.netty/netty-common/jars/netty-common-4.0..Final.jar:META-INF/io.netty.versions.properties
[error] /Users/qpzhang/.ivy2/cache/io.netty/netty-transport/jars/netty-transport-4.0..Final.jar:META-INF/io.netty.versions.properties
[error] /Users/qpzhang/.ivy2/cache/io.netty/netty-codec/jars/netty-codec-4.0..Final.jar:META-INF/io.netty.versions.properties
[error] Total time: s, completed -- ::

大概是说,这里面有很多路径一样的重复文件,它处理不了。怎么办?

只好手动来进行判断,assembly提供了不打包文件的规则,这些可以用脚本写在build.sbt文件中。

参考:https://github.com/sbt/sbt-assembly#excluding-jars-and-files

在我们这里,脚本是这样的(注意:sbt是当时最新的 0.13版本):

qpzhang@qpzhangdeMac-mini:~/scala_code/CassandraTest $cat build.sbt

name := "CassandraTest"

version := "1.0"

scalaVersion := "2.10.4"

#spark的依赖直接忽略, 使用关键词provided表示运行环境已经有,不需要打包
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2" % "provided" #依赖spark-cassandra-connector的库
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2"

#如果后缀是.properties的文件,合并策略采用(MergeStrategy.first)第一个出现的文件
assemblyMergeStrategy in assembly := {
case PathList(ps @ _*) if ps.last endsWith ".properties" => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

这样就搞定了,其它的情况,再根据修改一下合并策略咯。

> assembly
[info] Including from cache: slf4j-api-1.7..jar
[info] Including from cache: metrics-core-3.0..jar
[info] Including from cache: netty-codec-4.0..Final.jar
[info] Including from cache: netty-handler-4.0..Final.jar
[info] Including from cache: netty-common-4.0..Final.jar
[info] Including from cache: joda-time-2.3.jar
[info] Including from cache: netty-buffer-4.0..Final.jar
[info] Including from cache: commons-lang3-3.3..jar
[info] Including from cache: jsr166e-1.1..jar
[info] Including from cache: cassandra-clientutil-2.1..jar
[info] Including from cache: joda-convert-1.2.jar
[info] Including from cache: netty-transport-4.0..Final.jar
[info] Including from cache: guava-16.0..jar
[info] Including from cache: spark-cassandra-connector_2.-1.5.-M2.jar
[info] Including from cache: cassandra-driver-core-2.2.-rc3.jar
[info] Including from cache: scala-reflect-2.10..jar
[info] Including from cache: scala-library-2.10..jar
[info] Checking every *.class/*.jar file's SHA-1.
[info] Merging files...
[warn] Merging 'META-INF/INDEX.LIST' with strategy 'discard'
[warn] Merging 'META-INF/MANIFEST.MF' with strategy 'discard'
[warn] Merging 'META-INF/io.netty.versions.properties' with strategy 'first'
[warn] Merging 'META-INF/maven/com.codahale.metrics/metrics-core/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/com.datastax.cassandra/cassandra-driver-core/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/com.google.guava/guava/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/com.twitter/jsr166e/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/io.netty/netty-buffer/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/io.netty/netty-codec/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/io.netty/netty-common/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/io.netty/netty-handler/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/io.netty/netty-transport/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/joda-time/joda-time/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/org.apache.commons/commons-lang3/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/org.joda/joda-convert/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/org.slf4j/slf4j-api/pom.xml' with strategy 'discard'
[warn] Strategy 'discard' was applied to 15 files
[warn] Strategy 'first' was applied to a file
[info] SHA-1: d2cb403e090e6a3ae36b08c860b258c79120fc90
[info] Packaging /Users/qpzhang/scala_code/CassandraTest/target/scala-2.10/CassandraTest-assembly-1.0.jar ...
[info] Done packaging.
[success] Total time: 19 s, completed 2015-11-26 10:12:22

4.执行结果

qpzhang@qpzhangdeMac-mini:~/project/spark-1.5.-bin-hadoop2. $./bin/spark-submit --class "CassandraTestApp" --master local[] ~/scala_code/CassandraTest/target/scala-2.10/CassandraTest-assembly-1.0.jar
//...........................
// :: INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID , localhost, NODE_LOCAL, bytes)
// :: INFO Executor: Running task 0.0 in stage 0.0 (TID )
// :: INFO Executor: Fetching http://10.60.215.42:57683/jars/CassandraTest-assembly-1.0.jar with timestamp 1448509221160
// :: INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
// :: INFO Utils: Fetching http://10.60.215.42:57683/jars/CassandraTest-assembly-1.0.jar to /private/var/folders/2l/195zcc1n0sn2wjfjwf9hl9d80000gn/T/spark-4030cadf-8489-4540-976e-e98eedf50412/userFiles-63085bda-aa04-4906-9621-c1cedd98c163/fetchFileTemp7487594
.tmp
// :: INFO Executor: Adding file:/private/var/folders/2l/195zcc1n0sn2wjfjwf9hl9d80000gn/T/spark-4030cadf---976e-e98eedf50412/userFiles-63085bda-aa04---c1cedd98c163/CassandraTest-assembly-1.0.jar to class loader
// :: INFO Cluster: New Cassandra host localhost/127.0.0.1: added
// :: INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
// :: INFO Executor: Finished task 0.0 in stage 0.0 (TID ). bytes result sent to driver
// :: INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID ) in ms on localhost (/)
// :: INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: ResultStage (collect at CassandraTest.scala:) finished in 2.481 s
// :: INFO DAGScheduler: Job finished: collect at CassandraTest.scala:, took 2.940601 s
Existing Data: CassandraRow{key: 1, value: first row}
Existing Data: CassandraRow{key: 2, value: second row}
Existing Data: CassandraRow{key: 3, value: third row}
//....................
// :: INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: ResultStage (collect at CassandraTest.scala:) finished in 0.032 s
// :: INFO DAGScheduler: Job finished: collect at CassandraTest.scala:, took 0.046502 s
New Data: (4,fourth row)
New Data: (5,fifth row)
Work completed, stopping the Spark context.

最新文章

  1. let命令
  2. sh back mongo
  3. iOS-Runtime-Headers
  4. 进程 swoole
  5. 【转载】两个Web.config中连接字符串中特殊字符解决方案
  6. c#计算文件的MD5值
  7. 本地数据jqGrid分页
  8. 在IntelliJ IDEA中添加repository模板
  9. oracle 修改表空间文件路径方法
  10. kali在执行 apt-get update 命令时报错的解决方法
  11. 20175316盛茂淞 迭代和JDB
  12. WINDOWS内核编程(一)Hello Drv的实现
  13. P1754 球迷购票问题
  14. oracle的读写分离实现
  15. 力扣(LeetCode)67. 二进制求和
  16. Java 从服务器下载文件到本地(页面、后台、配置都有)
  17. UBUNTU安装 Rabbitvsc可视化版本控制客户端软件
  18. CMake Error at cmake/OpenCVModule.cmake:295 (message): No extra modules found in folder:Please provide path to 'opencv_contrib/modules' folder
  19. jvm(2)类的初始化(一)
  20. 理解vertical-align或“如何竖向居中”<转>

热门文章

  1. Mysql逻辑模块组成
  2. for xml path(''),root('')
  3. WEB-INF目录下的文件访问权限(待解决)
  4. 用WinDbg调试Windows和驱动程序
  5. 四则运算GUI设计
  6. 使用asmcmdcp命令把datafile从文件系统移动(move)到asm磁盘组中 针对11gR2
  7. 48. Remove Duplicates from Sorted List && Remove Duplicates from Sorted List II
  8. 按钮打开链接,按钮click代码
  9. python基础07 函数
  10. html和css基础