Question

我有一个带有一对nun_adj的RDD，我想使用mapPartitions查找其中名词为“统一”的第一对(adj, noun)

numPartitions = 10
lines = sc.textFile('adj_noun_pairs.txt', numPartitions)
pairs = lines.map(lambda l: tuple(l.split())).filter(lambda p: len(p)==2)
pairs.cache()

def f(iterator):
    for i in iterator:
    if (lambda x: 'unification' in x):
         yield i
    i+=1


result = lines.mapPartitions(f).collect()

文件“”，第9行，在TypeError中：强制转换为Unicode：需要字符串或缓冲区，找到int

Answer 1

这是一个有效的代码：

numPartitions = 10
lines = sc.textFile('adj_noun_pairs.txt', numPartitions)
pairs = lines.map(lambda l: 
tuple(l.split())).filter(lambda p: len(p)==2)
pairs.cache()

def f(iterator):
    for i in iterator:
        if u'unification' == tuple(i)[1]:
            yield i


pairs.mapPartitions(f).first()

我使用您定义但不使用的pairs RDD。另外，我使用u'unification'代替unification，因为每个单词的类型均为unicode（我在Python 2上进行了测试，我认为在Python 3中不需要它）。 i是对的集合，因此如果遇到“统一”，我将对其进行迭代并产生该对。 pairs.mapPartitions(f)然后是元组的集合（每个分区的第一个元组），所以我取第一个。

要并行查找.txt的单词，

1 个答案: