Question

我在

之前运行了此代码

df = sc.wholeTextFiles('./dbs-*.json,./uob-*.json').flatMap(lambda x: flattenTransactionFile(json.loads(x[1]))).toDF()

但现在看来，我得到了

Py4JJavaError: An error occurred while calling o24.partitions.
: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://localhost:9000/user/jiewmeng/dbs-*.json matches 0 files
Input Pattern hdfs://localhost:9000/user/jiewmeng/uob-*.json matches 0 files
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:330)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:272)
    at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55)

看起来火花正试图使用Hadoop？我如何使用本地文件？也是为什么突然失败？既然我之前设法使用./dbs-*.json了吗？

Answer 1

默认情况下，文件的位置与HDFS中的目录相关。要引用本地文件系统，您需要使用sc.textFile('myfile')

例如在cloudera VM中，如果我说

/user/cloudera/myfile

它将采用HDFS路径

sc.textFile('file:///home/cloudera/myfile')

哪里提到我的本地主目录我会说

res/drawable

Spark无法再访问本地文件了？

1 个答案: