我正在研究结构化流。我有15个主题。我正在使用readstream
阅读这些JSON消息。现在,我需要使用topic
列将Dataframe分为15个不同的Dataframe。
我遇到以下示例代码的问题。
我正在使用multiprocessing
。
import multiprocessing as mp
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2....") \
.option("subscribe", "stream") \
.option("startingOffsets", "earliest") \
.load()
df1 = df.select(from_json(col("value").cast("string"),delta_schema).alias("data")).select("data.*")
df2 = df1.select("topic","value.*")
def func(df,table_name):
df = df.filter((df.topic == table_name))
df.show()
def foreach_batch_function(df, epoch_id):
with mp.Pool(mp.cpu_count()) as p:
p.starmap(func, [(df,'x1'),(df,'x2').....])
pass
df2.writeStream.foreachBatch(foreach_batch_function).start()
我得到的错误:
AttributeError: Can't pickle local object 'Streaming.main.<locals>.func'