PySpark-根据另一个RDD过滤RDD-广播RDD

时间:2018-12-07 18:54:05

标签: apache-spark filter pyspark

我有两个RDD: 内容& 删除

两者都是带有多个单词的RDD。我想要的是过滤除去RDD中出现的内容中的所有单词。我正在尝试:

filter = contents.filter(lambda line: line[0] not in remove.collect()).collect()

但这给了我这个

Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

不能使用“过滤器”和“不在”运算符来做到这一点吗?

谢谢!

1 个答案:

答案 0 :(得分:1)

我似乎记得您不能广播RDD,它已经分发了。你证明了。

您不需要并行化删除列表,您可以广播也可以不广播它。例如

rdd = sc.parallelize(range(10))
remove = [5,6]
broadcast = sc.broadcast(remove)
rdd.filter(lambda x: x not in broadcast.value).collect()