'内容长度的过早结束'使用s3a的Spark应用程序

时间:2017-08-21 21:17:25

标签: amazon-web-services apache-spark amazon-s3 spark-dataframe emr

我正在编写一个基于Spark的应用程序,它可以处理存储在s3上的非常庞大的数据。它的大小 15 TB 未压缩。数据存放在多个小型LZO压缩文件文件中,大小不等10-100MB。

默认情况下,作业会在读取数据集并将其映射到架构时生成130k任务。

然后它失败了大约70k任务完成和大约20个任务失败。

例外:

WARN lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
    org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body

看起来s3连接过早关闭。

我尝试了近40种不同的配置组合。

总结一下:每个节点3个执行器的执行器,18GB到42GB --executor-memory,3-5 --executor-cores,1.8GB-4.0 GB spark.yarn.executor.memoryOverhead ,两者,Kryo和默认Java序列化程序,0.5到0.35 spark.memory.storageFraction,默认,130000到200000分区用于更大的数据集。默认值,200到2001 spark.sql.shuffle.partitions

最重要的是: 100到2048 fs.s3a.connection.maximum财产。

[这似乎是与例外最相关的财产。]

[在所有情况下,驱动程序设置为内存= 51GB,内核= 12,MEMORY_AND_DISK_SER级别进行缓存]

没有任何效果!

如果我使用较大数据集大小(7.5TB)的一半运行程序,它会在1.5小时内成功完成。

  1. 我能做错什么?
  2. 如何确定fs.s3a.connection.maximum的最佳值?
  3. s3客户端是否可能获得GCed?
  4. 任何帮助将不胜感激!

    环境:

    AWS EMR 5.7.0,60 x i2.2xlarge SPOT实例(16个vCPU,61GB RAM,2个800GB SSD),Spark 2.1.0

    YARN用作资源管理器。

    代码:

    这是一项相当简单的工作,做这样的事情:

    val sl = StorageLevel.MEMORY_AND_DISK_SER
    
    sparkSession.sparkContext.hadoopConfiguration.set("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec")
    sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sparkSession.sparkContext.hadoopConfiguration.setInt("fs.s3a.connection.maximum", 1200)
    
    val dataset_1: DataFrame = sparkSession
        .read
        .format("csv")
        .option("delimiter", ",")
        .schema(<schema: StructType>)
        .csv("s3a://...")
        .select("ID")   //15 TB
    
    dataset_1.persist(sl)
    
    print(dataset_1.count())
    
    tmp = dataset_1.groupBy(“ID”).agg(count("*").alias("count_id”))
    tmp2 = tmp.groupBy("count_id").agg(count("*").alias(“count_count_id”))
    tmp2.write.csv(…)
    
    dataset_1.unpersist()
    

    完整Stacktrace:

    17/08/21 20:02:36 INFO compress.CodecPool: Got brand-new decompressor [.lzo]
    17/08/21 20:06:18 WARN lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
    org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 79627927; received: 19388396
            at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180)
            at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
            at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
            at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)
            at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
            at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:108)
            at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)
            at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:160)
            at java.io.DataInputStream.read(DataInputStream.java:149)
            at com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:73)
            at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:321)
            at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:261)
            at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
            at java.io.InputStream.read(InputStream.java:101)
            at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
            at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
            at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
            at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:186)
            at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
            at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
            at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
            at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
            at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
            at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
            at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
            at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
            at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
            at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
            at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
            at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:99)
            at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:91)
            at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:364)
            at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1021)
            at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:996)
            at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936)
            at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:996)
            at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700)
            at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    

    编辑:我们有另一项服务使用完全相同的日志,它运行得很好。但它使用旧的&#34; s3://&#34;方案,基于Spark-1.6。我会尝试使用&#34; s3://&#34;而不是&#34; s3a://&#34;。

0 个答案:

没有答案