pyspark与嵌套列表和列表RDD相交

时间:2018-10-21 19:04:05

标签: apache-spark pyspark

我有一个RDD中的列表列表和一个要相交的列表。 B需要与A中的每个列表进行交互。

A = [[a,b,c,d],[e,f,g,h]....]
B = [a,b,c,d,e,f,g,h]

我需要将这两个相交以获得常用字母。我使用了以下内容,但由于typeError而出现错误

pwords = A.intersection(B)

然后我根据关于堆栈溢出的一些建议尝试使用并行化,但是出现了错误。

text_words = sc.parallelize(A)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/spark/python/pyspark/context.py", line 501, in 
parallelize
c = list(c)    # Make it a list so we can compute its length
TypeError: 'PipelinedRDD' object is not iterable

当我尝试转换为错误消息中所示的列表时。我再次遇到错误。

TypeError: 'PipelinedRDD' object is not iterable 

我尝试遵循Find intersection of two nested lists?并收到此错误:

TypeError: 'PipelinedRDD' object is not iterable

1 个答案:

答案 0 :(得分:0)

对于您的问题,我不是100%肯定的,但是我想您已将嵌套列表作为RDD并希望与静态列表B相交。然后,应检查嵌套列表中的每个项目在B和B中是否存在。如果存在,应该保留。

如果元素的顺序无关紧要,则可以使用以下代码:

A = [["a","b","c","d"],["e","f","g","h"],["i","j","k","l"],["a","b","x","f","y"]]
B = ["a","b","c","d","e","f","g","h"]

text_words = sc.parallelize(A)
text_words.map(lambda x: list(set(x) & set(B))).collect()

输出:

[['a', 'c', 'b', 'd'], ['h', 'e', 'g', 'f'], [], ['a', 'b', 'f']]