Question

我在Windows机器上使用Apache Spark。我对此比较陌生，在将代码上传到集群之前，我在本地工作。

我写了一个非常简单的scala程序，一切正常：

println("creating Dataframe from json")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rawData = sqlContext.read.json("test_data.txt")
println("this is the test data table")
rawData.show()
println("finished running")

程序正确执行。我现在想添加一些处理，它调用一些我已预先打包在JAR文件中的简单Java函数。我正在运行scala shell。正如它在入门页面上所述，我用以下命令启动shell：

c:\Users\eshalev\Desktop\spark-1.4.1-bin-hadoop2.6\bin\spark-shell --master local[4] --jars myjar-1.0-SNAPSHOT.jar

重要事实：我的本地计算机上没有安装hadoop。但是因为我只解析一个文本文件，所以这并不重要，直到我使用--jars才没关系。

我现在继续运行相同的scala程序。还没有对jar文件的引用......这次我得到了：

...some SPARK debug code here and then...
    15/09/08 14:27:37 INFO Executor: Fetching http://10.61.97.179:62752/jars/myjar-1.0-SNAPSHOT.jar with timestamp 144
    1715239626
    15/09/08 14:27:37 INFO Utils: Fetching http://10.61.97.179:62752/jars/myjar-1.0-SNAPSHOT.jar-1.0 to C:\Users\eshalev\A
    ppData\Local\Temp\spark-dd9eb37f-4033-4c37-bdbf-5df309b5eace\userFiles-ebe63c02-8161-4162-9dc0-74e3df6f7356\fetchFileTem
    p2982091960655942774.tmp
    15/09/08 14:27:37 INFO Executor: Fetching http://10.61.97.179:62752/jars/myjar-1.0-SNAPSHOT.jar with timestamp 144
    1715239626
    15/09/08 14:27:37 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
    java.lang.NullPointerException
            at java.lang.ProcessBuilder.start(Unknown Source)
            at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
            at org.apache.hadoop.util.Shell.run(Shell.java:455)
            at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
            at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
            at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
            at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
... aplenty more spark debug messages here, and then ...
this is the test data table
<console>:20: error: not found: value rawData
              rawData.show()
              ^
finished running

我仔细检查http://10.61.97.179:62752/jars/myjar-1.0-SNAPSHOT.jar-1.0-SNAPSHOT.jar，我可以下载它。然后，代码中的任何内容都没有引用jar。如果在没有--jar的情况下启动shell，一切正常。

Answer 1

我在另一个群集上试了这个，它是spark 1.3.1并安装了hadoop。它完美无缺。

在我的单节点设置的堆栈跟踪中提到hadoop的次数让我相信使用--jars标志需要实际的hadoop安装。

另一个选项是我的spark 1.4设置有问题，直到那时一直运行良好。

Apache Spark：导入jar

1 个答案: