了解较小的执行者为何失败而较大的成功者为何

时间:2019-04-13 01:50:49

标签: apache-spark pyspark

我有一份工作,可以解析大约1 TB的json格式的数据,并将其分割为20 mb的文件(这是因为实际上每分钟都会得到1gb的数据集)。

作业将解析,过滤和转换此数据,并将其写回到另一条路径。但是,它是否运行取决于火花的配置。

该集群由46个节点组成,每个节点具有96个内核和768 gb的内存。该驱动程序具有相同的规格。

我以独立模式提交作业,并且:

  1. 每个执行器使用22g和3个内核,由于gc和OOM而导致工作失败
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco                                              
  File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value                              
py4j.protocol.Py4JJavaError19/04/13 01:35:32 WARN TransportChannelHandler: Exception in connection from /10.0.118.151:34014      
java.lang.OutOfMemoryError: GC overhead limit exceeded                                                                          
        at com.sun.security.sasl.digest.DigestMD5Base$DigestIntegrity.getHMAC(DigestMD5Base.java:1060)                           
        at com.sun.security.sasl.digest.DigestMD5Base$DigestPrivacy.unwrap(DigestMD5Base.java:1470)                             
        at com.sun.security.sasl.digest.DigestMD5Base.unwrap(DigestMD5Base.java:213)                                           
        at org.apache.spark.network.sasl.SparkSaslServer.unwrap(SparkSaslServer.java:150)                                        
        at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:126)                       
        at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:101)                        
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)                          
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)              
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)             
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)                
        at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)                       
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)              
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)              
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)                
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)                     
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)             
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)             
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)                              
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)                      
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)                                                              
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)                                 
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)                                                                                              
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)                                                                                     
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)                         
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)             
        at java.lang.Thread.run(Thread.java:748)                                                                                 
: An error occurred while calling o54.json.                                                                                                                                                                                              
: java.lang.OutOfMemoryError: GC overhead limit exceeded  
  1. 每位执行者使用120g内核和15个内核,工作成功。

为什么作业在较小的内存/核心设置上失败?

注意: 爆炸操作可能也可能与此有关。编辑:不相关。测试了代码,做了一个简单的spark.read.json()。count()。show()并进行了gc和OOM处理。

我目前的宠物理论是大量的小文件导致高洗牌开销。请问这是怎么回事,有什么办法解决(除了分别重新聚合文件之外)?

代码要求: 启动器

./bin/spark-submit --master spark://0.0.0.0:7077 \
--conf "spark.executor.memory=90g" \
--conf "spark.executor.cores=12" \
--conf 'spark.default.parallelism=7200' \
--conf 'spark.sql.shuffle.partitions=380' \
--conf "spark.network.timeout=900s" \
--conf "spark.driver.extraClassPath=$LIB_JARS" \
--conf "spark.executor.extraClassPath=$LIB_JARS" \
--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
launcher.py

代码

    spark = SparkSession.builder \
        .appName('Rewrites by Frequency') \
        .getOrCreate()
    spark.read.json("s3://path/to/file").count()

0 个答案:

没有答案