我对Apache Spark和Python相对较新,并且想知道我将要描述的内容是否可行?
我有一个[m 1 ,m 2 ,m 3 ,m 4 形式的RDD ,m 5 ,m 6 ....... m n ](当你运行rdd.collect()时得到这个) 。我想知道是否有可能将此RDD转换为另一种形式的RDD [(m 1 ,m 2 ,m 3 ), (m 4 ,m 5 ,m 6 ).....(m n-2 ,m n-1 ,m n )]。内元组的大小应为k。如果n不能被k整除,那么其中一个元组应该少于k个元素。
我尝试使用map功能但无法获得所需的输出。似乎map函数只能返回一个RDD,其元素数量与最初提供的RDD相同。
更新:我尝试使用分区,并且还能够让它工作。
rdd.map(lambda l: (l, l)).partitionBy(int(n/k)).glom().map(lambda ll: [x[0] for x in ll])
答案 0 :(得分:3)
Olologin的答案几乎已经有了,但我相信你要做的是将你的RDD分成三元组而不是将你的RDD分组为3组元组。要做前者,请尝试以下方法:
rdd = sc.parallelize(["e1", "e2", "e3", "e4", "e5", "e6", "e7", "e8", "e9", "e10"])
transformed = rdd.zipWithIndex().groupBy(lambda (_, i): i / 3)
.map(lambda (_, list): tuple([elem[0] for elem in list]))
在pyspark中运行时,我得到以下内容:
>>> from __future__ import print_function
>>> rdd = sc.parallelize(["e1", "e2", "e3", "e4", "e5", "e6", "e7", "e8", "e9", "e10"])
>>> transformed = rdd.zipWithIndex().groupBy(lambda (_, i): i / 3).map(lambda (_, list): tuple([elem[0] for elem in list]))
>>> transformed.foreach(print)
...
('e4', 'e5', 'e6')
('e10',)
('e7', 'e8', 'e9')
('e1', 'e2', 'e3')
答案 1 :(得分:2)
我假设您使用的是pyspark api: 我不知道这是否是最好的解决方案,但我认为这可以通过以下方式完成: zipWithIndex groupBy 和简单的地图。
# 3 - your grouping k
# ci - list of tuples (char, idx)
rdd = sc.parallelize(["a", "b", "c", "d", "e"]).zipWithIndex()\
.groupBy(lambda (char, idx): idx/3 )\
.map(lambda (remainder, ci):tuple([char for char, idx in ci]))\
.collect()
print rdd
输出:
[('a', 'b', 'c'), ('d', 'e')]
UPD :感谢@Rohan Aletty纠正了我。
答案 2 :(得分:1)
可以在不改组(groupBy
)的情况下处理此问题,但与Olologin和Rohan Aletty的解决方案相比,它需要更多的代码。一个完整的想法是只传输保持分区之间连续性所需的部分:
from toolz import partition, drop, take, concatv
def grouped(self, n, pad=None):
"""
Group RDD into tuples of size n
>>> rdd = sc.parallelize(range(10))
>>> grouped(rdd, 3).collect()
>>> [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, None, None)]
"""
assert isinstance(n, int)
assert n > 0
def _analyze(i, iter):
"""
Given partition idx and iterator return a tuple
(idx, numbe-of-elements prefix-of-size-(n-1))
"""
xs = [x for x in iter]
return [(i, len(xs), xs[:n - 1])]
def _compact(prefixes, prefix):
"""
'Compact' a list of prefixes to compensate for
partitions with less than (n-1) elements
"""
return prefixes + [(prefix + prefixes[-1])[:n-1]]
def _compute(prvs, cnt):
"""
Compute number of elements to drop from current and
take from the next parition given previous state
"""
left_to_drop, _to_drop, _to_take = prvs[-1]
diff = cnt - left_to_drop
if diff <= 0:
return prvs + [(-diff, cnt, 0)]
else:
to_take = (n - diff % n) % n
return prvs + [(to_take, left_to_drop, to_take)]
def _group_partition(i, iter):
"""
Return grouped entries for a given partition
"""
(_, to_drop, to_take), next_head = heads_bd.value[i]
return partition(n, concatv(
drop(to_drop, iter), take(to_take, next_head)), pad=pad)
if n == 1:
return self.map(lambda x: (x, ))
idxs, counts, prefixes = zip(
*self.mapPartitionsWithIndex(_analyze).collect())
heads_bd = self.context.broadcast({x[0]: (x[1], x[2]) for x in zip(idxs,
reduce(_compute, counts, [(0, None, None)])[1:],
reduce(_compact, prefixes[::-1], [[]])[::-1][1:])})
return self.mapPartitionsWithIndex(_group_partition)
它在很大程度上取决于一个很棒的toolz
库,但是如果你想避免外部依赖,你可以使用标准库轻松地重写它。
使用示例:
>>> rdd = sc.parallelize(range(10))
>>> grouped(rdd, 3).collect()
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, None, None)]
如果你想保持一致的API,你可以修补RDD类:
>>> from pyspark.rdd import RDD
>>> RDD.grouped = grouped
>>> rdd.grouped(4).collect()
[(0, 1, 2, 3), (4, 5, 6, 7), (8, 9, None, None)]
您可以找到基本测试on GitHub。