sc.TextFile("")在Eclipse中工作但不在JAR中工作

时间:2017-12-14 09:15:09

标签: eclipse scala hadoop apache-spark rdd

我正在编写一个将在hadoop集群中的代码,但在此之前,我在本地使用本地文件进行测试。代码在Eclipse中工作得很好但是当我用SBT(使用spark lib等)创建一个巨大的JAR时,该程序正在工作,直到dotnet run --server.urls "http://0.0.0.0:80" 我的代码为止:

textFile(path)

这是我的错误:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.log4j.{Level, Logger}
import org.joda.time.format.DateTimeFormat
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ArrayBuffer




object TestCRA2 {

    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("Test")
      .set("spark.driver.memory", "4g")
      .set("spark.executor.memory", "4g")
    val context = new SparkContext(conf)//.master("local")
    val rootLogger = Logger.getRootLogger()
    rootLogger.setLevel(Level.ERROR)

    def TimeParse1(path: String) : RDD[(Int,Long,Long)] = {
        val data = context.textFile(path).map(_.split(";"))
        return data
    }

    def main(args: Array[String]) {

        val data = TimeParse1("file:///home/quentin/Downloads/CRA") 
    }
}

我无法将我的文件放入JAR,因为它们位于群集hadoop中并且它正在使用Eclipse。

这是我的build.sbt:

Exception in thread "main" java.io.IOException: No FileSystem for scheme: file
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
    at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:341)
    at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1034)
    at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1029)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
    at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1029)
    at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:832)
    at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:830)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
    at org.apache.spark.SparkContext.textFile(SparkContext.scala:830)
    at main.scala.TestCRA2$.TimeParse1(TestCRA.scala:37)
    at main.scala.TestCRA2$.main(TestCRA.scala:84)
    at main.scala.TestCRA2.main(TestCRA.scala)

如果我不这样做name := "BloomFilters" version := "1.0" scalaVersion := "2.11.6" libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0" libraryDependencies += "joda-time" % "joda-time" % "2.9.3" assemblyMergeStrategy in assembly := { case PathList("META-INF", xs @ _*) => MergeStrategy.discard case x => MergeStrategy.first } ,我就会遇到很多合并错误。

其实我需要像这样更改我的assemblyMergeStrategy

build.sbt

谢谢@lyomi

1 个答案:

答案 0 :(得分:1)

您的sbt assembly可能忽略了一些必需的文件。具体来说,Hadoop的FileSystem类依赖于在类路径中查找所有META-INFO/services/org.apache.hadoop.fs.FileSystem文件的服务发现机制。

在Eclipse上没问题,因为每个JAR都有相应的文件,但在uber-jar中可能会覆盖其他文件,导致file:方案无法识别。

在您的SBT设置中,添加以下内容,以连接服务发现文件,而不是丢弃其中一些文件。

val defaultMergeStrategy: String => MergeStrategy = { 
  case PathList("META-INF", xs @ _*) =>
    (xs map {_.toLowerCase}) match {
      // ... possibly other settings ...
      case "services" :: xs =>
        MergeStrategy.filterDistinctLines
      case _ => MergeStrategy.deduplicate
  }
  case _ => MergeStrategy.deduplicate
}

有关详细信息,请参阅README.md of sbt-assembly