我正在执行Spark作业,并且它处于群集模式。因此,我的驱动程序下载了文件,并将其添加到所有执行器SparkSession.sparkContext.addFile("file:///" + file.toString)
注意: file
这是 java.io.File
对象。现在我叫sc.textFile("file:///"+SparkFiles.get(fileName)
注意: fileName
实际上是 file.getName
,它返回的文件名。 java.io.File
对象。我收到文件找不到异常。文件大小小于500kb。我尝试阅读纱线记录,发现了这一点。
10:35:17 INFO executor.Executor: Fetching spark://foo.bar.ca:45133/files/QQ4hyC.csv with timestamp 1562250908486
19/07/04 10:35:17 INFO client.TransportClientFactory: Successfully created connection to foo.bar.ca/102.63.12.200:45000 after 1 ms (0 ms spent in bootstraps)
19/07/04 10:35:17 INFO util.Utils: Fetching spark://foo.bar.ca:45133/files/QQ4hyC.csv to /data15/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/spark-d0ee1f52-0b3d-4f28-8c40-31283fbc6c00/fetchFileTemp6119089786950130363.tmp
19/07/04 10:35:17 INFO util.Utils: Copying /data15/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/spark-d0ee1f52-0b3d-4f28-8c40-31283fbc6c00/-20762980841562250908486_cache to /data19/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/container_e94_1559671533076_152651_01_000002/./QQ4hyC.csv
19/07/04 10:35:17 INFO executor.Executor: Fetching spark://foo.bar.ca:45133/files/20mo2V.csv with timestamp 1562250908498
19/07/04 10:35:17 INFO util.Utils: Fetching spark://foo.bar.ca:45133/files/20mo2V.csv to /data15/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/spark-d0ee1f52-0b3d-4f28-8c40-31283fbc6c00/fetchFileTemp4236523310531688097.tmp
19/07/04 10:35:17 INFO util.Utils: Copying /data15/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/spark-d0ee1f52-0b3d-4f28-8c40-31283fbc6c00/-3045146541562250908498_cache to /data19/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/container_e94_1559671533076_152651_01_000002/./20mo2V.csv
19/07/04 10:35:17 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
19/07/04 10:35:17 INFO client.TransportClientFactory: Successfully created connection to foo.bar.ca/102.63.12.200:46650 after 5 ms (0 ms spent in bootstraps)
19/07/04 10:35:17 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.2 KB, free 912.3 MB)
19/07/04 10:35:17 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 148 ms
19/07/04 10:35:18 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.6 KB, free 912.3 MB)
19/07/04 10:35:18 INFO rdd.HadoopRDD: Input split: file:/data20/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/spark-28989d25-5d6e-4e49-8513-699d01ac0976/userFiles-dd2b37d9-0029-486e-9c55-84703887b1ca/QQ4hyC.csv:0+48343
19/07/04 10:35:18 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
19/07/04 10:35:18 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 31.5 KB, free 912.3 MB)
19/07/04 10:35:18 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 27 ms
19/07/04 10:35:18 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 512.6 KB, free 911.8 MB)
19/07/04 10:35:19 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.FileNotFoundException: File file:/data20/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/spark-28989d25-5d6e-4e49-8513-699d01ac0976/userFiles-dd2b37d9-0029-486e-9c55-84703887b1ca/QQ4hyC.csv does not exist
如果查看INFO util.Utils: Copying
,您会发现它确实已从/data15/...
复制到/data19/...
,但是它会抛出在/data19/...
找不到的文件
从官方文档看来,SparkContext.addFiles()
将文件添加到工作程序,而SparkFiles.get()
可以从复制该文件的工作程序节点获取文件。这是错误吗?