我正在编写一个将在hadoop集群中的代码,但在此之前,我在本地使用本地文件进行测试。代码在Eclipse中工作得很好但是当我用SBT(使用spark lib等)创建一个巨大的JAR时,该程序正在工作,直到dotnet run --server.urls "http://0.0.0.0:80"
我的代码为止:
textFile(path)
这是我的错误:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.log4j.{Level, Logger}
import org.joda.time.format.DateTimeFormat
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ArrayBuffer
object TestCRA2 {
val conf = new SparkConf()
.setMaster("local")
.setAppName("Test")
.set("spark.driver.memory", "4g")
.set("spark.executor.memory", "4g")
val context = new SparkContext(conf)//.master("local")
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
def TimeParse1(path: String) : RDD[(Int,Long,Long)] = {
val data = context.textFile(path).map(_.split(";"))
return data
}
def main(args: Array[String]) {
val data = TimeParse1("file:///home/quentin/Downloads/CRA")
}
}
我无法将我的文件放入JAR,因为它们位于群集hadoop中并且它正在使用Eclipse。
这是我的build.sbt:
Exception in thread "main" java.io.IOException: No FileSystem for scheme: file
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:341)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1034)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1029)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1029)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:832)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:830)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:830)
at main.scala.TestCRA2$.TimeParse1(TestCRA.scala:37)
at main.scala.TestCRA2$.main(TestCRA.scala:84)
at main.scala.TestCRA2.main(TestCRA.scala)
如果我不这样做name := "BloomFilters"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "joda-time" % "joda-time" % "2.9.3"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
,我就会遇到很多合并错误。
其实我需要像这样更改我的assemblyMergeStrategy
:
build.sbt
谢谢@lyomi
答案 0 :(得分:1)
您的sbt assembly
可能忽略了一些必需的文件。具体来说,Hadoop的FileSystem
类依赖于在类路径中查找所有META-INFO/services/org.apache.hadoop.fs.FileSystem
文件的服务发现机制。
在Eclipse上没问题,因为每个JAR都有相应的文件,但在uber-jar中可能会覆盖其他文件,导致file:
方案无法识别。
在您的SBT设置中,添加以下内容,以连接服务发现文件,而不是丢弃其中一些文件。
val defaultMergeStrategy: String => MergeStrategy = {
case PathList("META-INF", xs @ _*) =>
(xs map {_.toLowerCase}) match {
// ... possibly other settings ...
case "services" :: xs =>
MergeStrategy.filterDistinctLines
case _ => MergeStrategy.deduplicate
}
case _ => MergeStrategy.deduplicate
}
有关详细信息,请参阅README.md of sbt-assembly。