如何在Spark中自定义拆分和分区

时间:2019-02-04 01:11:37

标签: apache-spark hadoop rdd

在Spark中重新分区时如何自定义拆分。我正在寻找MapReduce的hasMoreKeyValue(),nextKeyValue()的等效实现,Spark中的一种方法,以在跨转换的并行处理中自定义边界拆分,并自定义提供给转换的下一个值的迭代。

Eg Scenario:
Data is like ticket threads, each element in the thread has a chain of elements and each element has 3 parts.

1 # Problem OR Hand over Recommendation Annoted as @Problem or @HandOver 
2 # Action taken so far      "" @Action
3 # Hand Over Recommendation OR closing Note   "" @HandOver or @closing.
Tail of an Element, if not a closingNote would be head of the next element.
Similarly Head of an element if its not annotated as @Problem would be tail of previous element.

In another words, @HandOver could act as tail for one element and head for the next.

已经重新分配了已从文本文件加载到RDD中的数据。每个分区可以包含多个元素,但不应包含部分元素。

0 个答案:

没有答案