Question

我使用pyspark在spark工作。当我使用下面的lambda

时，我有一个格式为[(key, (num, (min, max, count))),....]的rdd

t = fullBids.filter(lambda (value, stats): (stats[2] > 10 and stats[0] < value and value < stats[1]))

错误

tuple index out of range

但是当我在mapValues中使用它时，它会成功运行，正确返回True或False。

ti = fullBids.mapValues(lambda (value, stats): (stats[2] > 10 and stats[0] < value and value < stats[1]))

我希望过滤器可以工作，但它不是。有人可以解释我在这里失踪的东西吗？

Answer 1

如果您分解RDD格式

(key, (num, (min, max, count)))

key = value
(num, (min, max, count)) = stats
num = stats[0]
(min, max, count) = stats[1]
min = stats[1][0]
max = stats[1][1]
count = stats[1][2]

因此stats[2]超出了范围

Answer 2

当您致电filter时，value是键值对RDD的关键，而stats是RDD（(num, (min, max, count))）的值，即＆＃39 ;为什么你有一个tuple index out of range。

当您致电mapValues时，value为num而stats为(min, max, count)。事实上，mapValues转换传递了键值对RDD中的每个值。

PySpark函数在mapValues和filter

2 个答案: