我有以下简单的wordcount Python脚本。
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf = conf)
from operator import add
f=sc.textFile("C:/Spark/spark-1.2.0/README.md")
wc=f.flatMap(lambda x: x.split(" ")).map(lambda x: (x,1)).reduceByKey(add)
print wc
wc.saveAsTextFile("wc_out.txt")
我使用以下命令行启动此脚本:
spark-submit "C:/Users/Alexis/Desktop/SparkTest.py"
我收到以下错误:
Picked up _JAVA_OPTIONS: -Djava.net.preferIPv4Stack=true
15/04/20 18:58:01 WARN Utils: Your hostname, AE-LenovoUltra resolves to a loopba
ck address: 127.0.1.2; using 192.168.1.63 instead (on interface net0)
15/04/20 18:58:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another
address
15/04/20 18:58:10 WARN NativeCodeLoader: Unable to load native-hadoop library fo
r your platform... using builtin-java classes where applicable
15/04/20 18:58:11 ERROR Shell: Failed to locate the winutils binary in the hadoo
p binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Ha
doop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:867)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:411)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:969)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
0)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
0)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:280)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.sc
ala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
orAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
onstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand
.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
File "C:/Users/Alexis/Desktop/SparkTest.py", line 3, in <module>
sc = SparkContext(conf = conf)
File "C:\Spark\spark-1.2.0\python\pyspark\context.py", line 105, in __init__
conf, jsc)
File "C:\Spark\spark-1.2.0\python\pyspark\context.py", line 153, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "C:\Spark\spark-1.2.0\python\pyspark\context.py", line 201, in _initializ
e_context
return self._jvm.JavaSparkContext(jconf)
File "C:\Spark\spark-1.2.0\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.p
y", line 701, in __call__
File "C:\Spark\spark-1.2.0\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spa
rk.api.java.JavaSparkContext.
: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
589)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:411)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:969)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
0)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
0)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:280)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.sc
ala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
orAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
onstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand
.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
对于像我这样的Spark初学者,似乎这就是问题:&#34;错误Shell:无法在hadoop二进制路径中找到winutils二进制文件&#34;。但是,Spark文档明确指出,Spark不需要在独立模式下运行Hadoop安装。
我做错了什么?
答案 0 :(得分:5)
好消息是你没有做错任何事情,你的代码将在错误减轻后运行。
尽管声明Spark将在没有Hadoop的Windows上运行,但它仍然在寻找一些Hadoop组件。该bug有一个JIRA票证(SPARK-2356),并且有一个补丁可用。从Spark 1.3.1开始,补丁尚未提交到主分支。
幸运的是,这是一个相当容易的工作。
在Spark安装目录下为winutils创建bin目录。就我而言,Spark安装在D:\ Languages \ Spark中,因此我创建了以下路径:D:\ Languages \ Spark \ winutils \ bin
从Hortonworks下载winutils.exe并将其放入第一步创建的bin目录中。下载Win64的链接:http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe
创建一个指向winutils目录(而不是bin子目录)的“HADOOP_HOME”环境变量。您可以通过以下几种方式完成此操作:
一个。通过Control Panel -> System -> Advanced System Settings -> Advanced Tab -> Environment variables
建立永久环境变量。您可以使用以下参数创建用户变量或系统变量:
Variable Name=HADOOP_HOME
Variable Value=D:\Languages\Spark\winutils\
湾在命令shell中设置临时环境变量 在执行脚本之前
set HADOOP_HOME=d:\\Languages\\Spark\\winutils
运行您的代码。它现在应该没有错误。