应用错误收集

我有一项需要在数据帧聚合的结果上运行的任务。在伪代码中，这看起来像：

def perform_task(param1, param2):
  counts1 = someDF1.where(F.col('param1')==param1).groupBy(param2).count()
  counts2 = someDF2.where(F.col('param1')==param1).groupBy(param2).count()
  return algorithmResult(counts1.toPandas(), counts2.toPandas())

for param_set in all_params:
  print perform_tas(*param_set)

如何在Spark中正确并行化以下代码？将param_set转换为并行化集合并执行.map（）不会起作用，因为我在地图函数中访问DataFrames - 所以＆＃34;正确的＆＃34;这样做的方式？

我对Spark很新，所以欢迎任何建议。谢谢！

如何在Apache Spark中并行化任务

0 个答案: