以下错误在PySpark中意味着什么?

时间:2015-11-18 20:36:41

标签: apache-spark pyspark

我正在使用本教程:http://spark.apache.org/docs/latest/quick-start.html无济于事。

我尝试了以下内容:

textFile=sc.textFile("README.md")
textFile.count()

以下是我收到的输出而不是所需的结果,126。

> textFile=sc.textFile("README.md")
15/11/18 13:19:49 INFO MemoryStore: ensureFreeSpace(182712) called with curMem=2
54076, maxMem=556038881
15/11/18 13:19:49 INFO MemoryStore: Block broadcast_2 stored as values in memory
 (estimated size 178.4 KB, free 529.9 MB)
15/11/18 13:19:49 INFO MemoryStore: ensureFreeSpace(17179) called with curMem=43
6788, maxMem=556038881
15/11/18 13:19:49 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in
memory (estimated size 16.8 KB, free 529.8 MB)
15/11/18 13:19:49 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on l
ocalhost:61916 (size: 16.8 KB, free: 530.2 MB)
15/11/18 13:19:49 INFO SparkContext: Created broadcast 2 from textFile at null:-
2

> textFile.count()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 1006, in cou
nt
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 997, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 871, in fold

    vals = self.mapPartitions(func).collect()
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 773, in coll
ect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java
_gateway.py", line 538, in __call__
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\sql\utils.py", line 36, in
 deco
    return f(*a, **kw)
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\prot
ocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.
api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: fil
e:/C:/Users/Administrator/Downloads/spark-1.5.2-bin-hadoop2.4/spark-1.5.2-bin-ha
doop2.4/spark-1.5.2-bin-hadoop2.4/bin/README.md
        at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(Fil
eInputFormat.java:285)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
ava:228)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
va:304)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.
scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:5
8)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.s
cala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.s
cala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
        at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scal
a:405)
        at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala
)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Unknown Source)

1 个答案:

答案 0 :(得分:2)

正如@santon所说,你的输入路径不存在;实际上,文件README.md位于Spark主目录下,不在<{1}}下。以下是Ubuntu中的情况:

$SPARK_HOME/bin

因此,由于 ~$ echo $SPARK_HOME /usr/local/bin/spark-1.5.1-bin-hadoop2.6 ~$ cd $SPARK_HOME /usr/local/bin/spark-1.5.1-bin-hadoop2.6$ ls bin conf ec2 lib NOTICE R RELEASE CHANGES.txt data examples LICENSE python README.md sbin 不在您的工作目录中,您应该提供完整路径,或者确保该文件存在于当前工作目录中,这是你已经开始README.md

pyspark

现在,您的代码将起作用,因为 /usr/local/bin/spark-1.5.1-bin-hadoop2.6$ ./bin/pyspark [...] >>> import os >>> os.getcwd() '/usr/local/bin/spark-1.5.1-bin-hadoop2.6' >>> os.listdir(os.getcwd()) ['lib', 'LICENSE', 'python', 'NOTICE', 'examples', 'ec2', 'README.md', 'conf', 'CHANGES.txt', 'R', 'data', 'RELEASE', 'bin', 'sbin'] 位于您的工作目录中:

README.md
顺便说一句,正确答案是98(交叉检查) - 不确定为什么教程要求126.

总结一下,使用 >>> textFile=sc.textFile("README.md") [...] >>> textFile.count() [...] 98 确保您要查找的文件存在于当前工作目录中;如果是,您可以使用未经修改的代码;如果没有,您应该提供完整的文件路径,或者使用适当的Python命令更改工作目录。