Spark sql非常慢 - 几个小时后失败 - Executors Lost

时间:2016-04-11 04:24:53

标签: apache-spark pyspark apache-spark-sql

我正在使用大量文件(~50K)的数据集~16Tb上尝试Spark Sql。每个文件大约400-500 Megs。

我正在使用过滤器(没有groupBy和Joins)对数据集发出一个相当简单的hive查询,并且作业非常慢。它运行7-8小时,在12节点集群上处理大约80-100 Gigs。

我已经尝试了从20到4000的spark.sql.shuffle.partitions的不同值,但是没有看到很多不同。

从日志中我得到了结尾处的纱线错误[1]。我已经得到了以下火花配置[2]。

我还需要调查其他调整吗?任何提示将不胜感激,

由于

2. Spark config - 
spark-submit
--master yarn-client
--driver-memory 1G
--executor-memory 10G
--executor-cores 5
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.minExecutors=2


1. Yarn Error:

16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed: container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1459747472046_1618_02_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1

我已经探索了容器日志,但没有从中获取大量信息。

我已经看到了几个容器的错误日志,但不确定原因。

1. java.lang.NullPointerException  at org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:167)
2. java.lang.ClassCastException: Cannot cast org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisterExecutorFailed to org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisteredExecutor$

0 个答案:

没有答案