我有13.2 GB的HDFS目录和4个文件。我试图在火花中使用wholeTextFile
方法读取所有文件,但是我遇到了一些问题
这是我的代码。
val path = "/tmp/cnt/warehouse/"
val whole = sc.wholeTextFiles("path",32)
val data = whole.map(r => (r._1,r._2.split("\r\n")))
val x = file.flatMap(r => r._1)
x.take(1000).foreach(println)
下面是火花提交。
spark2-submit \
--class SparkTest \
--master yarn \
--deploy-mode cluster \
--num-executors 32 \
--executor-memory 15G \
--driver-memory 25G \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.port.maxRetries=100 \
--conf spark.kryoserializer.buffer.max=1g \
--conf spark.yarn.queue=xyz \
SparkTest-1.0-SNAPSHOT.jar
下面的错误
Job aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most recent failure: Lost task 0.3 in stage 32.0 (TID 113, , executor 37): ExecutorLostFailure (executor 37 exited caused by one of the running tasks) Reason: Container from a bad node: container_e599_1560551438641_35180_01_000057 on host: . Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_e599_1560551438641_35180_01_000057
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.__launchContainer__(LinuxContainerExecutor.java:399)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 52
.
Driver stacktrace:
答案 0 :(得分:0)
- 即使我给了32个最小分区,它也存储在4个分区中
您可以参考下面的链接
Spark Creates Less Partitions Then minPartitions Argument on WholeTextFiles
- 我的Spark提交是否正确?
语法是正确的,但是您传递的值超出了所需的值。我是说您要给Executors提供32 * 15 = 480 GB +为驱动程序提供25 GB来处理13 GB数据?
提供更多的执行程序和更多的内存不会产生有效的结果。有时会导致开销,并且由于资源不足也会失败
错误也显示您正在使用的资源问题。
对于仅处理13 GB的数据,应使用以下配置(并非完全如此,您必须计算):
执行者#6 核心#5 执行器内存5 GB 驱动程序内存2 GB
有关更多详细信息和计算,您可以参考以下链接:
How to tune spark executor number, cores and executor memory?
注意:驱动程序不需要比执行程序更多的内存,因此驱动程序 在大多数情况下,内存应小于或等于执行程序的内存。