我们目前正在尝试使用PySpark 2.2.0在Dataproc集群上运行Spark作业,除非Spark作业在看似随机的时间过后停止,并显示以下错误消息:
17/07/25 00:52:48 ERROR org.apache.spark.api.python.PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:702)
错误有时可能只需要几分钟,或者可能需要3个小时。根据个人经验,Spark工作运行大约30分钟到1小时,然后才能发现错误。
一旦Spark工作遇到错误,它就会停止。无论我等多久,它都不输出任何东西。在YARN ResourceManager上,应用程序状态仍标记为" RUNNING"我必须按Ctrl + C来终止程序。此时,应用程序标记为"完成"。
我在主节点的控制台上使用/path/to/spark/bin/spark-submit --jars /path/to/jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.0.jar spark_job.py
命令运行Spark作业。 JAR文件是必需的,因为Spark作业从Kafka传输消息(在与Spark作业相同的集群上运行)并将一些消息推回到同一个Kafka到另一个主题。
我已经在本网站上查看了其他一些答案(主要是this和this),但它们有所帮助,但我们无法追踪到哪里在日志中可能会说明导致执行程序死亡的原因。到目前为止,我已经通过YARN ResourceManager监视任务期间的节点,并浏览了每个节点中/var/logs/hadoop-yarn
目录中的日志。唯一的线索"我可以在日志中找到org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
,这是唯一写入死执行程序日志的行。
作为最后的努力,我们试图增加群集的内存大小,希望问题可以消失,但它没有。最初,群集在1个主2工作集群上运行,具有4vCPU,15GB内存。我们创建了一个新的Dataproc集群,这次有1个主服务器和3个工作服务器,每个工作服务器都有8vCPU 52GB内存(主服务器具有与之前相同的规格)。
我们想知道的是:
1.我在哪里/如何看到导致执行人被终止的例外?
2.这是Spark配置的问题吗?
3.数据图像版本是"预览"。这可能是导致错误的原因吗?
最后,
4.我们如何解决这个问题?我们还可以采取哪些其他措施?
这个Spark作业需要连续无限期地从Kafka流出,所以我们希望修复这个错误,而不是延长错误发生的时间。
以下是来自YARN ResourceManager的一些截图,用于演示我们所看到的内容:
屏幕截图来自Spark作业从错误中停止。
这是位于/path/to/spark/conf/spark-defaults.conf
的Spark配置文件(没有更改Dataproc的默认设置):
spark.master yarn
spark.submit.deployMode client
spark.yarn.jars=local:/usr/lib/spark/jars/*
spark.eventLog.enabled true
spark.eventLog.dir hdfs://highmem-m/user/spark/eventlog
# Dynamic allocation on YARN
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.executor.instances 10000
spark.dynamicAllocation.maxExecutors 10000
spark.shuffle.service.enabled true
spark.scheduler.minRegisteredResourcesRatio 0.0
spark.yarn.historyServer.address highmem-m:18080
spark.history.fs.logDirectory hdfs://highmem-m/user/spark/eventlog
spark.executor.cores 2
spark.executor.memory 4655m
spark.yarn.executor.memoryOverhead 465
# Overkill
spark.yarn.am.memory 4655m
spark.yarn.am.memoryOverhead 465
spark.driver.memory 3768m
spark.driver.maxResultSize 1884m
spark.rpc.message.maxSize 512
# Add ALPN for Bigtable
spark.driver.extraJavaOptions
spark.executor.extraJavaOptions
# Disable Parquet metadata caching as its URI re-encoding logic does
# not work for GCS URIs (b/28306549). The net effect of this is that
# Parquet metadata will be read both driver side and executor side.
spark.sql.parquet.cacheMetadata=false
# User-supplied properties.
#Mon Jul 24 23:12:12 UTC 2017
spark.executor.cores=4
spark.executor.memory=18619m
spark.driver.memory=3840m
spark.driver.maxResultSize=1920m
spark.yarn.am.memory=640m
spark.executorEnv.PYTHONHASHSEED=0
我不太确定User-supplied properties
的来源。
修改
有关群集的一些其他信息:
我按照zookeeper的顺序使用https://github.com/GoogleCloudPlatform/dataproc-initialization-actions
找到的zookeeper,kafka和jupyter初始化动作脚本 - > kafka - > jupyter(很遗憾,我现在没有足够的声誉发布超过2个链接)
编辑2:
从@ Dennis的深刻见解中,我们运行了Spark工作,同时特别关注使用了较高堆栈存储内存的执行程序。我注意到,与其他执行程序相比,工作程序#0的执行程序始终具有更高的存储内存使用量。 worker#0的执行程序的stdout文件始终为空。这三行在stderr中重复多次:
17/07/27 16:32:01 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:01 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:01 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:04 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:04 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:04 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:07 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:07 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:07 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:09 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:09 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:09 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:10 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:10 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:10 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:13 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:13 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:13 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:14 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:14 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:14 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:15 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:15 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:15 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:18 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:18 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:18 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
似乎每1~3秒重复一次。
对于来自其他工作节点的其他执行程序的stdout和stderr,它们都是空的。
编辑3:
正如@Dennis的评论中所提到的,我们保留了Spark工作正在消耗的Kafka主题,复制因子为1.我还发现我已经忘记将工作者#2添加到Kafka中的zookeeper.connect配置文件,也忘了在Spark中向Kafka提供消费者流消息组ID。我已经修复了这些地方(用复制因子3重新制作主题)并观察到现在工作量主要集中在工人#1上。根据@Dennis的建议,我在SSH到工作人员#1之后运行sudo jps
并得到以下输出:
[Removed this section to save character space; it was only the error messages from a failed call to jmap so it didn't hold any useful information]
编辑4:
我现在在工人#1执行者身上看到这一点' stdout文件:
2017-07-27 22:16:24
Full thread dump OpenJDK 64-Bit Server VM (25.131-b11 mixed mode):
===Truncated===
Heap
PSYoungGen total 814592K, used 470009K [0x000000063c180000, 0x000000069e600000, 0x00000007c0000000)
eden space 799744K, 56% used [0x000000063c180000,0x0000000657e53598,0x000000066ce80000)
from space 14848K, 97% used [0x000000069d780000,0x000000069e5ab1b8,0x000000069e600000)
to space 51200K, 0% used [0x0000000698200000,0x0000000698200000,0x000000069b400000)
ParOldGen total 574464K, used 180616K [0x0000000334400000, 0x0000000357500000, 0x000000063c180000)
object space 574464K, 31% used [0x0000000334400000,0x000000033f462240,0x0000000357500000)
Metaspace used 49078K, capacity 49874K, committed 50048K, reserved 1093632K
class space used 6054K, capacity 6263K, committed 6272K, reserved 1048576K
和
2017-07-27 22:06:44
Full thread dump OpenJDK 64-Bit Server VM (25.131-b11 mixed mode):
===Truncated===
Heap
PSYoungGen total 608768K, used 547401K [0x000000063c180000, 0x000000066a280000, 0x00000007c0000000)
eden space 601088K, 89% used [0x000000063c180000,0x000000065d09c498,0x0000000660c80000)
from space 7680K, 99% used [0x0000000669b00000,0x000000066a2762c8,0x000000066a280000)
to space 36864K, 0% used [0x0000000665a80000,0x0000000665a80000,0x0000000667e80000)
ParOldGen total 535552K, used 199304K [0x0000000334400000, 0x0000000354f00000, 0x000000063c180000)
object space 535552K, 37% used [0x0000000334400000,0x00000003406a2340,0x0000000354f00000)
Metaspace used 48810K, capacity 49554K, committed 49792K, reserved 1093632K
class space used 6054K, capacity 6263K, committed 6272K, reserved 1048576K
当错误发生时,来自#2工作人员的执行人收到SIGNAL TERM
并被标记为已死。在这个时候,它是唯一的死亡执行者。
编辑5:
再次,按照@Dennis的建议(谢谢你,@ Dennis!),这一次是sudo -u yarn jmap -histo <pid>
。这是大约10分钟后来自CoarseGrainedExecutorBackend
的大多数记忆占用类的前10名:
num #instances #bytes class name
----------------------------------------------
1: 244824 358007944 [B
2: 194242 221184584 [I
3: 2062554 163729952 [C
4: 746240 35435976 [Ljava.lang.Object;
5: 738 24194592 [Lorg.apache.spark.unsafe.memory.MemoryBlock;
6: 975513 23412312 java.lang.String
7: 129645 13483080 java.io.ObjectStreamClass
8: 451343 10832232 java.lang.StringBuilder
9: 38880 10572504 [Z
10: 120807 8698104 java.lang.reflect.Field
此外,我遇到了一种新的错误,导致执行程序死亡。它产生了一些在Spark UI中突出显示的失败任务,并在执行程序的stderr中找到了它:
17/07/28 00:44:03 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 6821.0 (TID 2585)
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
17/07/28 00:44:03 ERROR org.apache.spark.executor.Executor: Exception in task 0.1 in stage 6821.0 (TID 2586)
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
17/07/28 00:44:03 ERROR org.apache.spark.util.Utils: Uncaught exception in thread stdout writer for /opt/conda/bin/python
java.lang.AssertionError: assertion failed: Block rdd_5480_0 is not locked for reading
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299)
at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:720)
at org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:516)
at org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
17/07/28 00:44:03 ERROR org.apache.spark.util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout writer for /opt/conda/bin/python,5,main]
java.lang.AssertionError: assertion failed: Block rdd_5480_0 is not locked for reading
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299)
at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:720)
at org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:516)
at org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
编辑6:
这次,我在运行40分钟后取jmap
:
num #instances #bytes class name
----------------------------------------------
1: 23667 391136256 [B
2: 25937 15932728 [I
3: 159174 12750016 [C
4: 334 10949856 [Lorg.apache.spark.unsafe.memory.MemoryBlock;
5: 78437 5473992 [Ljava.lang.Object;
6: 125322 3007728 java.lang.String
7: 40931 2947032 java.lang.reflect.Field
8: 63431 2029792 com.esotericsoftware.kryo.Registration
9: 20897 1337408 com.esotericsoftware.kryo.serializers.UnsafeCacheFields$UnsafeObjectField
10: 20323 975504 java.util.HashMap
这些是ps ux
的结果:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
yarn 601 0.8 0.9 3008024 528812 ? Sl 16:12 1:17 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dproc_nodema
yarn 6086 6.3 0.0 96764 24340 ? R 18:37 0:02 /opt/conda/bin/python -m pyspark.daemon
yarn 8036 8.2 0.0 96296 24136 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8173 9.4 0.0 97108 24444 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8240 9.0 0.0 96984 24576 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8329 7.6 0.0 96948 24720 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8420 8.5 0.0 96240 23788 ? R 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8487 6.0 0.0 96864 24308 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8554 0.0 0.0 96292 23724 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8564 0.0 0.0 19100 2448 pts/0 R+ 18:37 0:00 ps ux
yarn 31705 0.0 0.0 13260 2756 ? S 17:56 0:00 bash /hadoop/yarn/nm-local-dir/usercache/<user_name>/app
yarn 31707 0.0 0.0 13272 2876 ? Ss 17:56 0:00 /bin/bash -c /usr/lib/jvm/java-8-openjdk-amd64/bin/java
yarn 31713 0.4 0.7 2419520 399072 ? Sl 17:56 0:11 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xmx6
yarn 31771 0.0 0.0 13260 2740 ? S 17:56 0:00 bash /hadoop/yarn/nm-local-dir/usercache/<user_name>/app
yarn 31774 0.0 0.0 13284 2800 ? Ss 17:56 0:00 /bin/bash -c /usr/lib/jvm/java-8-openjdk-amd64/bin/java
yarn 31780 11.1 1.4 21759016 752132 ? Sl 17:56 4:31 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xmx1
yarn 31883 0.1 0.0 96292 27308 ? S 17:56 0:02 /opt/conda/bin/python -m pyspark.daemon
在这种情况下,pid
的{{1}}为CoarseGrainedExecutorBackEnd
。
编辑7:
在Spark设置中增加31780
并没有改变任何东西,这在后见之明是有道理的。
我创建了一个简短的bash脚本,它使用控制台使用者从Kafka读取5秒钟,并将消息写入文本文件。文本文件上传到Spark流式传输的Hadoop。我们通过这种方法测试了Timeout是否与Kafka有关。
所以我们继续假设卡夫卡与超时无关。
我们安装了Stackdriver Monitoring以查看超时发生时的内存使用情况。指标没什么特别有意义的;内存使用率在整个过程中看起来相对稳定(对于最繁忙的节点,最多徘徊在10~15%左右)。
我们猜测可能与工作节点之间的通信有关可能导致问题。目前,我们的数据流量非常低,因此即使是一名工作人员也可以相对轻松地处理所有工作量。
在单个节点集群上运行Spark作业,而来自不同集群的Kafka代理的流式传输似乎已经停止了SocketTimeout ......除了上面记录的heartbeatInterval
现在经常发生。
Per @ Dennis的建议,我这次创建了一个没有jupyter初始化脚本的新集群(也是单节点),这意味着Spark现在在Python v2.7.9上运行(没有Anaconda)。第一次运行,Spark在15秒内遇到AssertionError
。第二次跑了2个多小时,失败了SocketTimeoutException
。我开始怀疑这是Spark的内部问题。第三次运行大约40分钟,然后进入AssertionError
。
答案 0 :(得分:1)
我的一个客户发现Google Cloud Dataproc中的各种生产Pyspark作业(Spark版本2.2.1)间歇性地失败,并且堆栈跟踪与您的非常相似:
ERROR org.apache.spark.api.python.PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:711)
我发现在Dataproc群集VM上禁用ipv6似乎可以解决此问题。一种方法是将这些行添加到Dataproc初始化脚本中,以便它们在集群创建时运行:
printf "\nnet.ipv6.conf.default.disable_ipv6 = 1\nnet.ipv6.conf.all.disable_ipv6=1\n" >> /etc/sysctl.conf
sysctl -p