pypsark-kafka DirectStream如何从主题获取所有数据

时间:2019-02-04 18:05:48

标签: pyspark apache-kafka spark-streaming

我使用以下代码在pyspark和kafka之间创建了直接流:

import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
    sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
    ssc = StreamingContext(sc, 2)
    brokers, topic = sys.argv[1:]
    kvs = KafkaUtils.createDirectStream(ssc, [topic],{"metadata.broker.list": brokers})
    lines = kvs.map(lambda x: x[1])
    counts = lines.flatMap(lambda line: line.split(" ")) \
                  .map(lambda word: (word, 1)) \
                  .reduceByKey(lambda a, b: a+b)
    counts.pprint()
    ssc.start()
    ssc.awaitTermination()

问题是启动流后,我只获得了发布在Kafka中的数据。反正还有获取旧数据的地方吗?

0 个答案:

没有答案