我有一个RDD中的列表列表和一个要相交的列表。 B需要与A中的每个列表进行交互。
A = [[a,b,c,d],[e,f,g,h]....]
B = [a,b,c,d,e,f,g,h]
我需要将这两个相交以获得常用字母。我使用了以下内容,但由于typeError而出现错误
pwords = A.intersection(B)
然后我根据关于堆栈溢出的一些建议尝试使用并行化,但是出现了错误。
text_words = sc.parallelize(A)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/spark/python/pyspark/context.py", line 501, in
parallelize
c = list(c) # Make it a list so we can compute its length
TypeError: 'PipelinedRDD' object is not iterable
当我尝试转换为错误消息中所示的列表时。我再次遇到错误。
TypeError: 'PipelinedRDD' object is not iterable
我尝试遵循Find intersection of two nested lists?并收到此错误:
TypeError: 'PipelinedRDD' object is not iterable
答案 0 :(得分:0)
对于您的问题,我不是100%肯定的,但是我想您已将嵌套列表作为RDD并希望与静态列表B相交。然后,应检查嵌套列表中的每个项目在B和B中是否存在。如果存在,应该保留。
如果元素的顺序无关紧要,则可以使用以下代码:
A = [["a","b","c","d"],["e","f","g","h"],["i","j","k","l"],["a","b","x","f","y"]]
B = ["a","b","c","d","e","f","g","h"]
text_words = sc.parallelize(A)
text_words.map(lambda x: list(set(x) & set(B))).collect()
输出:
[['a', 'c', 'b', 'd'], ['h', 'e', 'g', 'f'], [], ['a', 'b', 'f']]