使用检查点在火花流中丢失消息

时间:2017-02-28 10:37:22

标签: apache-spark pyspark apache-kafka spark-streaming

使用Spark-Streaming和Checkpoint从Kafka读取,但丢失了消息。

生成测试流的代码:

from kafka import KafkaProducer
import time

p = KafkaProducer(bootstrap_servers='kafka.dev:9092')
for i in range(1000):
    time.sleep(2)
    p.send('y_test', value='{"test": ' + str(i) + '}')

从卡夫卡读取的代码:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

def createContext():
    sc = SparkContext(appName='test_app')
    ssc = StreamingContext(sc, 60)
    kafkaStream = KafkaUtils.createStream(ssc, zkQuorum='kafka.dev:2181',
            groupId='test_app', topics={'y_test': 1})
    kafkaStream.saveAsTextFiles('test_dir/')
    ssc.checkpoint('checkpoint_dir')
    return ssc

context = StreamingContext.getOrCreate('checkpoint_dir', createContext)
context.start()
context.awaitTermination()

我如何检查:

1)开始阅读代码

2)启动生成代码

3)重新启动代码以便阅读

4)读取hdfs并查看数据失败:

{"test": 1}
{"test": 2}
{"test": 3}
{"test": 4}
{"test": 8}
{"test": 9}
{"test": 10}
{"test": 11}
{"test": 12}

Kafka 0.9.0.0,Spark 1.6.0

0 个答案:

没有答案