使用Spark-Streaming和Checkpoint从Kafka读取,但丢失了消息。
生成测试流的代码:
from kafka import KafkaProducer
import time
p = KafkaProducer(bootstrap_servers='kafka.dev:9092')
for i in range(1000):
time.sleep(2)
p.send('y_test', value='{"test": ' + str(i) + '}')
从卡夫卡读取的代码:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
def createContext():
sc = SparkContext(appName='test_app')
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, zkQuorum='kafka.dev:2181',
groupId='test_app', topics={'y_test': 1})
kafkaStream.saveAsTextFiles('test_dir/')
ssc.checkpoint('checkpoint_dir')
return ssc
context = StreamingContext.getOrCreate('checkpoint_dir', createContext)
context.start()
context.awaitTermination()
我如何检查:
1)开始阅读代码
2)启动生成代码
3)重新启动代码以便阅读
4)读取hdfs并查看数据失败:
{"test": 1}
{"test": 2}
{"test": 3}
{"test": 4}
{"test": 8}
{"test": 9}
{"test": 10}
{"test": 11}
{"test": 12}
Kafka 0.9.0.0,Spark 1.6.0