作业中没有指定输入路径

时间:2017-08-01 19:36:03

标签: pyspark

请帮忙解决这个问题,谢谢:

我的工作中有一个群集,并且有三个节点安装了spark。我想做一些使用spark的工作,在其中一个节点上,我的数据文件在我的目录中,当我创建一个数据帧时,似乎位置存在一些问题?我已经将dat文件chmod到777,下面是我正在运行的命令:

>>> df = sqlContext.read.text("/home/rx52019/data/airports-extended.dat")
17/08/01 15:29:10 INFO text.TextRelation: Listing hdfs://dev-icg/home/rx52019/data/airports-extended.dat on driver
>>> df.printSchema
<bound method DataFrame.printSchema of DataFrame[value: string]>
>>>

如果我运行df.show(10),我会得到:

>>> df.show(10)
17/08/01 15:31:34 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 286.5 KB, free 1819.7 KB)
17/08/01 15:31:34 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 24.0 KB, free 1843.7 KB)
17/08/01 15:31:34 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 10.49.31.80:44407 (size: 24.0 KB, free: 529.9 MB)
17/08/01 15:31:34 INFO spark.SparkContext: Created broadcast 8 from showString at NativeMethodAccessorImpl.java:-2
17/08/01 15:31:34 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 272.7 KB, free 2.1 MB)
17/08/01 15:31:34 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 24.0 KB, free 2.1 MB)
17/08/01 15:31:34 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on 10.49.31.80:44407 (size: 24.0 KB, free: 529.8 MB)
17/08/01 15:31:34 INFO spark.SparkContext: Created broadcast 9 from showString at NativeMethodAccessorImpl.java:-2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/pyspark/sql/dataframe.py", line 257, in show
    print(self._jdf.showString(n, truncate))
  File "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o196.showString.
: java.io.IOException: No input paths specified in job
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:202)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)

1 个答案:

答案 0 :(得分:0)

我通过将文件上传到hdfs而不是将其保存为linux文件

来对其进行排序
hadoop fs -put /home/rx52019/data/airports-extended.dat hdfs://dev/user/spark/

然后我将加载代码更改为:

df = sqlContext.read.text("hdfs://dev/user/spark/airports-extended.dat")

df.show(10)按预期工作。