当我从Kafka主题创建流并打印其内容时
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell'
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext(appName="PythonStreamingKafkaWords")
ssc = StreamingContext(sc, 10)
lines = KafkaUtils.createDirectStream(ssc, ['sample_topic'], {"bootstrap.servers": 'localhost:9092'})
lines.pprint()
ssc.start()
ssc.awaitTermination()
我得到一个空结果
-------------------------------------------
Time: 2019-12-07 13:11:50
-------------------------------------------
-------------------------------------------
Time: 2019-12-07 13:12:00
-------------------------------------------
-------------------------------------------
Time: 2019-12-07 13:12:10
-------------------------------------------
同时,它可以在控制台中工作:
kafka-console-consumer --topic sample_topic --from-beginning --bootstrap-server localhost:9092
正确地给我Kafka主题中所有文本行:
ham Ok lor... Sony ericsson salesman... I ask shuhui then she say quite gd 2 use so i considering...
ham Ard 6 like dat lor.
ham Why don't you wait 'til at least wednesday to see if you get your .
ham Huh y lei...
spam REMINDER FROM O2: To get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode
spam This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.
ham Will ü b going to esplanade fr home?
. . .
将数据从Kafka主题流式传输到Spark流式传输应用程序的正确方法是什么?
答案 0 :(得分:1)
根据您的代码,我们无法直接打印流式RDD,而应基于foreachRDD进行打印。DStream.foreachRDD是Spark Streaming中的“输出运算符”。它使您可以访问DStream的基础RDD,以执行对数据有实际作用的操作。
What's the meaning of DStream.foreachRDD function?
注意::您仍然可以通过结构化流媒体来实现。 ref:Pyspark Structured streaming processing
示例工作代码::该代码试图从kafka主题中读取消息并进行打印。您可以根据需要更改此代码。
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
def handler(message):
records = message.collect()
for record in records:
print(record[1])
def main():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 10)
kvs = KafkaUtils.createDirectStream(ssc, ['topic_name'], {"metadata.broker.list": 'localhost:9192'},valueDecoder=serializer.decode_message)
kvs.foreachRDD(handler)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
答案 1 :(得分:0)
答案 2 :(得分:0)
在流输出中看不到任何数据的原因是,默认情况下,火花流开始从latest
读取数据。因此,如果先启动Spark Streaming应用程序,然后将数据写入Kafka,您将在流作业中看到输出。请参阅文档here:
默认情况下,它将从每个Kafka分区的最新偏移量开始消耗
但是您还可以从主题的任何特定偏移量读取数据。看一下createDirectStream
方法here。它使用dict参数fromOffsets
,您可以在其中指定字典中每个分区的偏移量。
我已经使用kafka 2.2.0和spark 2.4.3和Python 3.7.3测试了以下代码:
使用kafka依赖项启动pyspark
shell:
pyspark --packages org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.4.0
运行以下代码:
from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1)
topicPartion = TopicAndPartition('test',0)
fromOffset = {topicPartion: 0}
lines = KafkaUtils.createDirectStream(ssc, ['test'],{"bootstrap.servers": 'localhost:9092'}, fromOffsets=fromOffset)
lines.pprint()
ssc.start()
ssc.awaitTermination()
如果您有kafka broker版本10或更高版本,还应该考虑使用结构化流而不是Spark流。请参阅结构化流文档here和具有Kafka集成的结构化流here。
下面是在结构化流中运行的示例代码。
请根据您的Kafka版本和spark版本使用jar版本。
我将spark 2.4.3
与Scala 11
和kafka 0.10
一起使用,所以使用jar spark-sql-kafka-0-10_2.11:2.4.3
。
启动pyspark
shell:
pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.option("startingOffsets", "earliest") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.format("console") \
.start()