Question

Pcollection<String> p1 = {"a","b","c"}

PCollection< KV<Integer,String> > p2 = p1.apply("some operation ") 
//{(1,"a"),(2,"b"),(3,"c")}

我需要使其可扩展以适用于Apache Spark之类的大文件，使其工作方式如下：

sc.textFile("./filename").zipWithIndex

我的目标是通过以可扩展的方式分配行号来保留大文件中行之间的顺序。

我如何通过Apache Beam获得结果？

一些相关职位： zipWithIndex on Apache Flink

Ranking pcollection elements

Answer 1

没有内置方法可以做到这一点。 Beam中的PCollections是无序的，可能是无边界的，并且在多个工作程序上并行处理。 PCollection来自具有已知顺序的源这一事实无法在Beam模型中直接观察到。我认为更简单的方法是在Beam管道中使用文件之前对其进行预处理。

Answer 2

（正在复制my response from user@beam.apache.org）

这很有趣。因此，如果我了解您的算法，它将类似于（伪代码）：

A = ReadWithShardedLineNumbers(myFile) : output K<ShardOffset+LocalLineNumber>, V<Data>
B = A.ExtractShardOffsetKeys() : output K<ShardOffset>, V<LocalLineNumber>
C = B.PerKeySum() : output K<ShardOffset>, V<ShardTotalLines>
D = C.GlobalSortAndPrefixSum() : output K<ShardOffset> V<ShardLineNumberOffset>
E = [A,D].JoinAndCalculateGlobalLineNumbers() : output V<GlobalLineNumber+Data>

这有两个假设：

ReadWithShardedLineNumbers：源可以输出其分片偏移量，并且偏移量是全局排序的
GlobalSortAndPrefixSum：所有读取分片的总数可以容纳在内存中，以执行总排序

假设2并非对所有数据大小都成立，并且根据读取碎片的粒度不同，假定条件也不同。但是对于一些实际的文件大小子集来说，这似乎是可行的。

此外，我相信上面的伪代码可以在Beam中表示，并且不需要SplittableDoFn。

如何在Apache Beam中实现Spark等zipWithIndex？

2 个答案: