Question

我们正在使用spark进行文件处理。我们正在处理相当大的文件，每个文件大约30 GB，大约有4千万到5千万行。这些文件已格式化。我们将它们加载到数据框中。最初的要求是识别匹配标准的记录并将其加载到MySQL。我们能够做到这一点。

最近需求已更改。现在，不符合标准的记录将存储在备用数据库中。这会导致问题，因为集合的大小太大。我们试图独立收集每个分区并合并到这里建议的列表

我们不熟悉scala，因此我们无法将其转换为Java。我们如何逐个迭代分区并收集？

由于

Answer 1

请使用df.foreachPartition独立执行每个分区，不会返回驱动程序。您可以将匹配结果保存到每个执行程序级别的DB中。如果要在驱动程序中收集结果，请使用不建议用于您的情况的mappartition。

请参阅以下链接

dataset.foreachPartition(new ForeachPartitionFunction<Row>() {
            public void call(Iterator<Row> r) throws Exception {
                while (t.hasNext()){

                    Row row = r.next();
                    System.out.println(row.getString(1));

                }
                // do your business logic and load into MySQL.
            }
        });

对于mappartition：

// You can use the same as Row but for clarity I am defining this.

public class ResultEntry implements Serializable {
  //define your df properties ..
}


Dataset<ResultEntry> mappedData = data.mapPartitions(new MapPartitionsFunction<Row, ResultEntry>() {

@Override
public Iterator<ResultEntry> call(Iterator<Row> it) {
  List<ResultEntry> filteredResult = new ArrayList<ResultEntry>();
  while (it.hasNext()) {
   Row row = it.next()
   if(somecondition)
       filteredResult.add(convertToResultEntry(row));
 }
return filteredResult.iterator();
}
}, Encoders.javaSerialization(ResultEntry.class));

希望这有帮助。

拉维

Spark - 使用foreachpartition收集分区

1 个答案: