我有两个RDD,一个非常大,另一个则小得多。我想用小RDD中的键找到大RDD中的所有唯一元组。
例如
large_rdd = sc.parallelize([('abcdefghij'[i%10], i) for i in range(100)] * 5)
small_rdd = sc.parallelize([('zab'[i%3], i) for i in range(10)])
expected_rdd = [
('a', [1, 4, 7, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90]),
('b', [2, 5, 8, 1, 11, 21, 31, 41, 51, 61, 71, 81, 91])]
我的解决方案中有两项昂贵的操作 - 加入和区分。我假设两者都导致完全shuffle并且让子RDD散列分区。鉴于此,以下是我能做的最好的事情吗?
keys = sc.broadcast(small_rdd.keys().distinct().collect())
filtered_unique_large_rdd = (large_rdd
.filter(lambda (k, v): k in keys.value)
.distinct()
.groupByKey())
(filtered_unique_large_rdd
.join(small_rdd.groupByKey())
.mapValues(lambda x: sum([list(i) for i in x], []))
.collect())
基本上,我明确地过滤元组,选择区别然后与smaller_rdd连接。我希望那个不同的操作会将键哈希分区,并且在后续连接期间不会引起另一个shuffle。
提前感谢任何建议/想法。
PS:它不是重复的 Which function in spark is used to combine two RDDs by keys因为加入(完全随机播放)是一种选择。答案 0 :(得分:1)
我的解决方案中有两项昂贵的操作 - 加入和区分。
实际上有三种昂贵的操作。您应该将groupByKey
添加到列表中。
我希望那个不同的操作会将键哈希分区,并且在后续连接期间不会引起另一个shuffle。
distinct
赢了,但随后groupByKey
会。问题是它需要将您的数据洗牌两次 - 一次用于distinct
,一次用于groupByKey
。
filtered_unique_large_rdd.toDebugString()
## (8) PythonRDD[27] at RDD at PythonRDD.scala:43 []
## | MapPartitionsRDD[26] at mapPartitions at PythonRDD.scala:374 []
## | ShuffledRDD[25] at partitionBy at NativeMethodAccessorImpl.java:-2 []
## +-(8) PairwiseRDD[24] at groupByKey at <ipython-input-11-8a3af1a8d06b>:2 []
## | PythonRDD[23] at groupByKey at <ipython-input-11-8a3af1a8d06b>:2 []
## | MapPartitionsRDD[22] at mapPartitions at PythonRDD.scala:374 []
## | ShuffledRDD[21] at partitionBy at NativeMethodAccessorImpl.java:-2 []
## +-(8) PairwiseRDD[20] at distinct at <ipython-input-11-8a3af1a8d06b>:2 []
## | PythonRDD[19] at distinct at <ipython-input-11-8a3af1a8d06b>:2 []
## | ParallelCollectionRDD[2] at parallelize at PythonRDD.scala:423 []
您可以尝试使用distinct
替换groupByKey
后跟aggregateByKey
:
zeroValue = set()
def seqFunc(acc, x):
acc.add(x)
return acc
def combFunc(acc1, acc2):
acc1.update(acc2)
return acc1
grouped_by_aggregate = (large_rdd
.filter(lambda kv: k[0] in keys.value)
.aggregateByKey(zeroValue, seqFunc, combFunc))
与您当前的解决方案相比,它只需要将large_rdd
洗牌一次:
grouped_by_aggregate.toDebugString()
## (8) PythonRDD[54] at RDD at PythonRDD.scala:43 []
## | MapPartitionsRDD[53] at mapPartitions at PythonRDD.scala:374
## | ShuffledRDD[52] at partitionBy at NativeMethodAccessorImpl.java:-2 []
## +-(8) PairwiseRDD[51] at aggregateByKey at <ipython-input-60-67c93b2860a0 ...
## | PythonRDD[50] at aggregateByKey at <ipython-input-60-67c93b2860a0> ...
## | ParallelCollectionRDD[2] at parallelize at PythonRDD.scala:423 []
另一个可能的改进是在广播之前将密钥转换为设置:
keys = sc.broadcast(set(small_rdd.keys().distinct().collect()))
现在,您的代码会针对过滤器的每个步骤对列表执行线性搜索。