Question

我的PySpark脚本将创建的DataFrame保存到目录：

df.write.save(full_path, format=file_format, mode=options['mode'])

如果我在同一次运行中读取此文件，一切都很好：

return sqlContext.read.format(file_format).load(full_path)

但是，当我尝试在另一个脚本运行中从此目录中读取文件时，我收到错误：

java.io.FileNotFoundException: File does not exist: /hadoop/log_files/some_data.json/part-00000-26c649cb-0c0f-421f-b04a-9d6a81bb6767.json

据我所知，我可以通过Spark的提示找到解决方法：

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

但是，我想知道我失败的原因，以及这种问题的正统方法是什么？

Answer 1

您正在尝试管理与同一文件相关的两个对象，因此涉及该对象的缓存将给您带来问题，它们都针对同一文件。一个简单的解决方案在这里，

https://stackoverflow.com/a/60328199/5647992

Pyspark将文件保存为镶木地板并阅读

1 个答案: