我想从MQTT接收JSON字符串并将其解析为DataFrames df
。我该怎么办?
这是我发送到MQTT队列以便在Spark中处理的Json消息的示例:
{
"id": 1,
"timestamp": 1532609003,
"distances": [2,5,7,8]
}
这是我的代码:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Test") \
.master("local[4]") \
.getOrCreate()
# Custom Structured Streaming receiver
reader = spark\
.readStream\
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")\
.option("topic","uwb/distances")\
.option('brokerUrl', 'tcp://127.0.0.1:1883')\
.load()\
.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS STRING)")
df = spark.read.json(reader.select("value").rdd)
# Start running the query that prints the running counts to the console
query = df \
.writeStream \
.format('console') \
.start()
query.awaitTermination()
但是此代码失败:
py4j.protocol.Py4JJavaError: An error occurred while calling o45.javaToPython.
: org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
mqtt
我尝试如下添加start
:
df = spark.read.json(reader.select("value").rdd) \
.writeStream \
.format('console') \
.start()
但是出现了同样的错误。我的目标是获得一个可以进一步通过ETL流程传递的DataFrame df
。
更新:
标记为答案的主题并没有帮助我解决问题。首先,当我使用PySpark时,它为Scala提供了解决方案。
其次,我测试了答案中提出的解决方案,并返回了空列json
:
reader = spark\
.readStream\
.schema(spark.read.json("mqtt_schema.json").schema) \
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")\
.option("topic","uwb/distances")\
.option('brokerUrl', 'tcp://127.0.0.1:1883')\
.load()\
.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS STRING)")
json_schema = spark.read.json("mqtt_schema.json").schema
df = reader.withColumn('json', from_json(col('value'), json_schema))
query = df \
.writeStream \
.format('console') \
.start()
答案 0 :(得分:0)
您必须使用from_json
或等效方法。 如果文档的结构看起来像问题中的
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import *
schema = StructType([
StructField("id", LongType()),
StructField("timestamp", LongType()),
StructField("distances", ArrayType(LongType()))
])
ds = spark.readStream.load(...)
ds.withColumn("value", from_json(col("value").cast("string"), schema))
答案 1 :(得分:0)
我想这是因为您的df没有流式传输。
reader.select("value").writestream