我正在使用本教程:http://spark.apache.org/docs/latest/quick-start.html无济于事。
我尝试了以下内容:
textFile=sc.textFile("README.md")
textFile.count()
以下是我收到的输出而不是所需的结果,126。
> textFile=sc.textFile("README.md")
15/11/18 13:19:49 INFO MemoryStore: ensureFreeSpace(182712) called with curMem=2
54076, maxMem=556038881
15/11/18 13:19:49 INFO MemoryStore: Block broadcast_2 stored as values in memory
(estimated size 178.4 KB, free 529.9 MB)
15/11/18 13:19:49 INFO MemoryStore: ensureFreeSpace(17179) called with curMem=43
6788, maxMem=556038881
15/11/18 13:19:49 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in
memory (estimated size 16.8 KB, free 529.8 MB)
15/11/18 13:19:49 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on l
ocalhost:61916 (size: 16.8 KB, free: 530.2 MB)
15/11/18 13:19:49 INFO SparkContext: Created broadcast 2 from textFile at null:-
2
> textFile.count()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 1006, in cou
nt
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 997, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 871, in fold
vals = self.mapPartitions(func).collect()
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 773, in coll
ect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java
_gateway.py", line 538, in __call__
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\sql\utils.py", line 36, in
deco
return f(*a, **kw)
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\prot
ocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.
api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: fil
e:/C:/Users/Administrator/Downloads/spark-1.5.2-bin-hadoop2.4/spark-1.5.2-bin-ha
doop2.4/spark-1.5.2-bin-hadoop2.4/bin/README.md
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(Fil
eInputFormat.java:285)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
ava:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
va:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.
scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:5
8)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.s
cala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.s
cala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scal
a:405)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala
)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
答案 0 :(得分:2)
正如@santon所说,你的输入路径不存在;实际上,文件README.md
位于Spark主目录下,不在<{1}}下。以下是Ubuntu中的情况:
$SPARK_HOME/bin
因此,由于 ~$ echo $SPARK_HOME
/usr/local/bin/spark-1.5.1-bin-hadoop2.6
~$ cd $SPARK_HOME
/usr/local/bin/spark-1.5.1-bin-hadoop2.6$ ls
bin conf ec2 lib NOTICE R RELEASE
CHANGES.txt data examples LICENSE python README.md sbin
不在您的工作目录中,您应该提供完整路径,或者确保该文件存在于当前工作目录中,这是你已经开始README.md
:
pyspark
现在,您的代码将起作用,因为 /usr/local/bin/spark-1.5.1-bin-hadoop2.6$ ./bin/pyspark
[...]
>>> import os
>>> os.getcwd()
'/usr/local/bin/spark-1.5.1-bin-hadoop2.6'
>>> os.listdir(os.getcwd())
['lib', 'LICENSE', 'python', 'NOTICE', 'examples', 'ec2', 'README.md', 'conf', 'CHANGES.txt', 'R', 'data', 'RELEASE', 'bin', 'sbin']
位于您的工作目录中:
README.md
顺便说一句,正确答案是98(交叉检查) - 不确定为什么教程要求126.
总结一下,使用 >>> textFile=sc.textFile("README.md")
[...]
>>> textFile.count()
[...]
98
确保您要查找的文件存在于当前工作目录中;如果是,您可以使用未经修改的代码;如果没有,您应该提供完整的文件路径,或者使用适当的Python命令更改工作目录。