Question

是否可以在数据帧上执行foreach以便我可以返回数据集？我有一个要求只能通过按顺序处理记录来满足，所以我在数据帧上使用foreach，但是我需要从结果中创建一个新数据集，这样我就可以将它写入镶木地板输出文件中。这个伪代码是我想要完成的：

dataframe.foreachPartition(
  it => {
  /// process records . . .
  /// write the results form this partition into a file for aggregation later
      sparkSession.write . . .
  }
);
// read a dataframe containing all the data sets written by the tasks
sparkSession.read . . .

我知道这很稀疏，但总结了我需要做的事情。在foreach中不允许调用sparkSession.write，所以我想知道是否还有其他方法。

Answer 1

实际上，您无法访问foreachPartition中的数据框或数据集，因为数据集和数据框同样是其他火花实体作为会话，只能从驱动程序代码中获取。

虽然可以使用foreachPartition中的Hadoop API直接生成镶木地板文件，因为可以访问分区的数据：

dfB.repartition(2).foreachPartition( iter => {
        iter.foreach(i => println(i))
    })

Media creation tool另一个描述此问题及其解决方案的线程

祝你好运

Spark 2.0从foreach创建数据框架

1 个答案: