如何在pyspark的DataStreamReader中解析json字符串列并创建数据框

时间:2019-02-15 01:21:49

标签: pyspark pyspark-sql spark-structured-streaming spark-streaming-kafka

我正在阅读有关kafka主题的消息

messageDFRaw = spark.readStream\
                    .format("kafka")\
                    .option("kafka.bootstrap.servers", "localhost:9092")\
                    .option("subscribe", "test-message")\
                    .load()

messageDF = messageDFRaw.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING) as dict")

当我从上面的查询中打印数据框时,我得到下面的控制台输出。

|key|dict|
|#badbunny |{"channel": "#badbunny", "username": "mgat22", "message": "cool"}|

如何从DataStreamReader创建数据框,使数据框的列为|key|channel| username| message|

我尝试遵循How to read records in JSON format from Kafka using Structured Streaming?

中的可接受答案
struct = StructType([
    StructField("channel", StringType()),
    StructField("username", StringType()),
    StructField("message", StringType()),
])

messageDFRaw.select(from_json("CAST(value AS STRING)", struct))

但是,我在Expected type 'StructField', got 'StructType' instead中得到了from_json()

1 个答案:

答案 0 :(得分:0)

我忽略了Expected type 'StructField', got 'StructType' instead中的警告from_json()

但是,我必须首先从kafka消息中转换值,然后在以后解析json模式。

messageDF = messageDFRaw.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

messageParsedDF = messageDF.select(from_json("value", struct_schema).alias("message"))

messageFlattenedDF = messageParsedDF.selectExpr("value.channel", "value.username", "value.message")