来自Kafka的Spark流错误导致数据丢失

时间:2017-06-04 06:59:31

标签: apache-spark pyspark apache-kafka spark-streaming

我有一个用Python编写的Spark流应用程序,它从Kafka收集数据并将其存储在文件系统中。当我运行它时,我看到很多"漏洞"在收集的数据中。在分析了日志之后,我意识到302000个作业中有285000个失败了,所有这些都有相同的例外:

Job aborted due to stage failure: Task 4 in stage 604348.0 failed 1 times, 
most recent failure: Lost task 4.0 in stage 604348.0 (TID 2097738, localhost): 
kafka.common.OffsetOutOfRangeException
    at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at java.lang.Class.newInstance(Class.java:442)
    at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86)
    at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184)
    at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
    at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21)
    at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
    at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
    at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)

我知道在尝试访问Kafka中不存在的偏移量时会发生此异常。我的Kafka话题保留了1小时,我认为我的工作卡住了不止一小时,并且在被释放后,Kafka队列中的数据不再可用。即使保留非常小,我也无法以小规模重现这个问题,我想知道这些工作是否真的会像我假设的那样卡住和释放(我怎么能避免它),或者我需要查看完全不同的方向。

0 个答案:

没有答案