我使用了这个pyspark代码:
signatures = signatures.groupByKey().map(lambda x: (x[0], [elem for elem in x[1].__iter__()])).cache().collect()
我获得了此输出(RDD):
(1, [[1, 31891011288540205849559551829790241508456516432], [1, 28971434183002082500813759681605076406295898007], [1, 84354247191629359011614612371642003229438145118], [1, 14879564999779411946535520329978444194295073263], [1, 28999405396879353085885150485918753398187917441], [3, 274378016236983705444587880288109426115402687], [3, 120052627645426913871540455290804229381930764767], [3, 113440107283022891200151098422815365240954899060], [3, 95554518001487118601391311753326782629149232562], [3, 84646902172764559093309166129305123869359546269], [5, 6236085341917560680351285350168314740288121088], [5, 28971434183002082500813759681605076406295898007], [5, 47263781832612219468430472591505267902435456768], [5, 48215701840864104930367382664962486536872207556], [5, 28999405396879353085885150485918753398187917441]])
(0, [[2, 6236085341917560680351285350168314740288121088], [2, 28971434183002082500813759681605076406295898007], [2, 47263781832612219468430472591505267902435456768], [2, 48215701840864104930367382664962486536872207556], [2, 28999405396879353085885150485918753398187917441], [4, 6236085341917560680351285350168314740288121088], [4, 28971434183002082500813759681605076406295898007], [4, 47263781832612219468430472591505267902435456768], [4, 48215701840864104930367382664962486536872207556], [4, 28999405396879353085885150485918753398187917441]])
现在,我必须减少此RDD,以便对于每一行,我都获得如下输出:
(1, [1, [31891011288540205849559551829790241508456516432, 28971434183002082500813759681605076406295898007, ...], [3, 274378016236983705444587880288109426115402687, 120052627645426913871540455290804229381930764767, ...]])
第二行也一样。
基本上:该行中的第一个元素应保持不变,但第二个元素(即列表列表)应为列表的列表,其中每个列表中的第一个元素相同,而其他元素是原始列表中的第二个元素,其中第一个元素相同。
我尝试了以下代码:
signatures = signatures.map(lambda x: (x[0], [k.append(x[1][1]) for x[1] in g
for k,g in itertools.groupby(sorted(itertools.chain.from_iterable(x[1])), operator.itemgetter(0))])).collect()
但是结果是原始列表。
我需要一个可以在地图中进行理解的解决方案,以减少,解决pyspark中的过滤操作,就像我尝试过的解决方案一样。