Question

是否可以在Spark结构化流媒体中使用mapPartitions？

遇到这些错误

选项1：

dataframe_python.mapPartitions(processfunction)

“ DataFrame”对象没有属性“ mapPartitions”

选项2：

dataframe_python.rdd.mapPartitions(processfunction);

'具有流源的查询必须使用writeStream.start（）;

是否可以在我的方案中使用mapPartitions？我的意图是将现有数据框转换为另一个数据框，同时通过发送一批行来最小化对外部资源API的调用。

例如processfunction将如下所示：

processfunction(rows):

 batch = list(rows)
 results = call_external_resource(batch)
 for row in rows 
  tmp_row = rows[i]
  tmp_row["new_column"] = results[i]
  yield Row(**tmp_row)

Answer 1

使用pyspark API，您可能无法直接在数据框上使用mapPartitions，而使用Spark scala API，便可以做到这一点。

如果您使用的是Spark 2.4或更高版本，则可以使用foreachBatch做类似的事情。

def map_partition_func(rows):
    row_list = list(rows)
    for row in row_list:
        yield row

def foreach_batch_func(df, epoch_id):
    # Transform and write batchDF
    mapPartOutput = df.rdd.mapPartitions(map_partition_func) 


mapPartitionsOutput = (inputDF
                       .writeStream
                       .foreachBatch(foreach_batch_func)
                       .trigger(processingTime='<trigger time>')
                       .start()
                      )

Spark结构化流：是否支持mapPartitions？

1 个答案: