Question

我有一份文件中单个单词的有序列表，如下所示：

words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]

我有第二个重要的双字母/搭配元组列表，如下：

bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]

我想迭代单个单词列表，并用下划线分隔的双字母替换相邻单词，最后得到如下列表：

words_fixed = ['apple_orange', 'boat', 'car', 'happy_day', 'cow', ...]

我考虑将words和bigrams扁平化为字符串（" ".join(words)等），然后使用正则表达式查找并替换相邻的单词，但这似乎非常低效和unpythonic。

快速匹配和组合元组列表中相邻列表元素的最佳方法是什么？

Answer 1

不像@ inspectorG4dget那样浮华：

words_fixed = []
last = None
for word in words:
    if (last,word) in bigrams:
        words_fixed.append( "%s_%s" % (last,word) )
        last = None
    else:
        if last:
            words_fixed.append( last )
        last = word
if last:
    words_fixed.append( last )

Answer 2

words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]

首先，进行一些优化：

import collections
bigrams = collections.defaultdict(set)
for w1,w2 in bigrams:
    bigrams[w1].add(w2)

现在，谈到有趣的事情：

import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
    if w1 in bigrams and w2 in bigrams[w1]:
        words_fixed.append("%s_%s" %(w1, w2))

如果你想看到你的双字母组合中没有的单词，除了你在双字母组中记录的单词之外，那么这应该可以解决问题：

import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
    if w1 in bigrams and w2 in bigrams[w1]:
        words_fixed.append("%s_%s" %(w1, w2))
    else:
        words_fixed.append(w1)

Answer 3

words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow']
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]

bigrams_dict = dict(item for item in bigrams)
bigrams_dict.update(item[::-1] for item in bigrams)

words_fixed = ["{}_{}".format(word, bigrams_dict[word]) 
    if word in bigrams_dict else word
    for word in words]

[edit]另一种创建字典的方法：

from itertools import chain
bigrams_rev = (reversed(x) for x in bigrams)
bigrams_dict = dict(chain(bigrams, bigrams_rev))

Answer 4

words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]
print 'words   :',words
print 'bigrams :',bigrams
print
def zwii(words,bigrams):
    it = iter(words)
    dict_bigrams = dict(bigrams)
    for x in it:
        if x in dict_bigrams:
            try:
                y = it.next()
                if dict_bigrams[x] == y:
                    yield '-'.join((x,y))
                else:
                    yield x
                    yield y
            except:
                yield x
        else:
            yield x

print list(zwii(words,bigrams))

结果

words   : ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams : [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]

['apple-orange', 'boat', 'car', 'happy-day', 'cow', 'big']

将相邻列表元素与Python中的元组列表进行匹配

4 个答案: