假设我有一个RDD,其密钥类似于[1, 2, 3, 4, 5...]
,现在我想将记录分组成几个区间来并行处理它们,但我想要将它们分开的组相互交叉。
例如,我想将它们分组为以下几组:[1~10],[5~15],[10~20] ......,这样[1~10]和[5~ 15]将需要记录7
进行处理。
我该怎么做?
答案 0 :(得分:1)
也许重新计算密钥?
class Bucket:
def __init__(self, min_, max_):
self._min = min_
self._max = max_
def __str__(self):
return "{0}~{1}".format(self._min, self._max)
def get_bucket_or_none(self, key):
if self._min <= key <= self._max:
return str(self)
else:
return None
def make_new_key_using_bucket_list(buckets_list, x):
return_list = []
for bucket in buckets_list:
new_key = bucket.get_bucket_or_none(x[0])
if new_key is not None:
return_list.append( (new_key, x[1]))
return return_list
rdd = sc.parallelize([(1, "A"), (5, "B"), (10, "C"), (10, "D"), (5, "E"),
(10, "F"), (7, "G"), (14, "H"), (18, "I"), (23, "J")])
buckets_list = [Bucket(1, 10), Bucket(5, 15), Bucket(10, 20)]
rdd_new_keys = rdd.flatMap(lambda x: make_new_key_using_bucket_list(buckets_list, x))
print rdd_new_keys.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()
[('10~20',['C','D','F','H','I']),('5~15',['B','C', 'D','E','F','G','H']),('1~10',['A','B','C','D','E', 'F','G'])]