结构化流上的多线程:PYSPARK

时间:2019-07-25 15:01:52

标签: apache-spark pyspark spark-structured-streaming

我正在研究结构化流。我有15个主题。我正在使用readstream阅读这些JSON消息。现在,我需要使用topic列将Dataframe分为15个不同的Dataframe。

我遇到以下示例代码的问题。

我正在使用multiprocessing

import multiprocessing as mp

df = spark \
      .readStream \
      .format("kafka") \
      .option("kafka.bootstrap.servers", "host1:port1,host2:port2....") \
      .option("subscribe", "stream") \
      .option("startingOffsets", "earliest")  \
      .load()
df1 = df.select(from_json(col("value").cast("string"),delta_schema).alias("data")).select("data.*")

df2 = df1.select("topic","value.*")

def func(df,table_name):
    df = df.filter((df.topic == table_name))
    df.show()

def foreach_batch_function(df, epoch_id):
    with mp.Pool(mp.cpu_count()) as p:
        p.starmap(func, [(df,'x1'),(df,'x2').....])        
    pass

df2.writeStream.foreachBatch(foreach_batch_function).start()

我得到的错误:

AttributeError: Can't pickle local object 'Streaming.main.<locals>.func'

0 个答案:

没有答案