Spark无法从检查点ProvisionedThroughputExceededException中恢复:分片shardId超出速率

时间:2019-12-24 18:12:17

标签: apache-spark amazon-kinesis checkpoint

我有一个spark应用程序,它从Kinesis中消耗了6个碎片。向Kinesis最多每秒生成2000条记录。在非高峰时间,数据仅以200条记录/秒的速度进入。每条记录为0.5K字节。因此,只有6个分片就足以解决这个问题。

我正在使用EMR 5.23.0,Spark 2.4.0,spark-streaming-kinesis-asl 2.4.0 我的集群中有6个r5.4xLarge,有足够的内存

最近,我正在尝试将应用程序检查点指向S3。我正在非高峰时间进行测试,因此数据传入速率非常低,例如200条记录/秒。我通过创建新上下文运行Spark应用程序,在s3创建了检查点,但是当我终止该应用程序并重新启动时,它无法从检查点恢复,并且错误消息如下,并且我的SparkUI显示所有批次卡住了: / p>

{
    "type": "FeatureCollection",
    "features": [],
    "totalFeatures": "unknown",
    "numberReturned": 0,
    "timeStamp": "2019-12-24T17:59:23.429Z",
    "crs": null
}

我的批处理间隔为2.5分钟,我以常规方式创建流:

19/12/24 00:15:21 WARN TaskSetManager: Lost task 571.0 in stage 33.0 (TID 4452, ip-172-17-32-11.ec2.internal, executor 9): org.apache.spark.SparkException: Gave up after 3 retries while getting shard iterator from sequence number 49601654074184110438492229476281538439036626028298502210, last exception:
        at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282)
        at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getKinesisIterator(KinesisBackedBlockRDD.scala:246)
        at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:206)
        at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162)
        at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133)
        at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException: Rate exceeded for shard shardId-000000000004 in stream my-stream-name under account my-account-number. (Service: AmazonKinesis; Status Code: 400; Error Code: ProvisionedThroughputExceededException; Request ID: e368b876-c315-d0f0-b513-e2af2bd14525)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
        at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2782)
        at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2749)
        at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2738)
        at com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetShardIterator(AmazonKinesisClient.java:1383)
        at com.amazonaws.services.kinesis.AmazonKinesisClient.getShardIterator(AmazonKinesisClient.java:1355)
        at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$3.apply(KinesisBackedBlockRDD.scala:247)
        at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$3.apply(KinesisBackedBlockRDD.scala:247)
        at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:269)
        ... 20 more

有人报告了相同的问题,但似乎没有答案:

http://mail-archives.apache.org/mod_mbox/spark-issues/201807.mbox/%3CJIRA.13175528.1532948992000.116869.1532949000171@Atlassian.JIRA%3E Checkpointing records with Amazon KCL throws ProvisionedThroughputExceededException

由于在许多Spark应用程序中,Spark在带有检查点的Kinesis中消费是很常见的事情,所以我想知道我做错了什么吗?是否还有其他人遇到同样的情况以及如何解决这个问题?

我想知道这是否是因为我的批处理间隔2.5分钟(150秒)太长了吗?因为150秒* 200条记录/秒=每批30000条记录,并且当检查点恢复尝试从kinesis加载30000条记录时,会引发错误吗?我应该将分片数量从6个增加到30个吗?

请帮助,我必须找到答案,这令人沮丧。

感谢您的帮助。

kinesis monitoring charts while the checkpoint recovery error is happening, you see read throughput exceeded

sparkui showing jobs stuck while the checkpoint recovery error is happening

0 个答案:

没有答案