Question

我想在Spark Streaming数据框中看到可用的数据，而在以后的部分中，我想对该数据进行业务操作。

到目前为止，我已经尝试将流式DataFrame转换为RDD。一旦该对象转换为RDD，我想应用一个函数来转换数据，并使用schema（针对特定消息）创建新列。

dsraw = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", bootstrap_kafka_server) \
    .option("subscribe", topic) \
    .load() \
    .selectExpr("CAST(value AS STRING)")


print "type (df_stream)", type(dsraw)
print "schema (dsraw)", dsraw.printSchema()


def show_data_fun(dsraw, epoch_id):
    dsraw.show()

    row_rdd = dsraw.rdd.map(lambda row: literal_eval(dsraw['value']))
    json_data = row_rdd.collect()

    print "From rdd : ", type(json_data)
    print "From rdd : ", json_data[0]
    print "show_data_function_call"


jsonDataQuery = dsraw \
    .writeStream \
    .foreach(show_data_fun)\
    .queryName("df_value")\
    .trigger(continuous='1 second')\
    .start()

print the first JSON message which is in the stream.

为什么在Spark结构化流中ForEach接收器不调用函数（show_data_function）？

0 个答案: