我有一份文件中单个单词的有序列表,如下所示:
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]
我有第二个重要的双字母/搭配元组列表,如下:
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]
我想迭代单个单词列表,并用下划线分隔的双字母替换相邻单词,最后得到如下列表:
words_fixed = ['apple_orange', 'boat', 'car', 'happy_day', 'cow', ...]
我考虑将words
和bigrams
扁平化为字符串(" ".join(words)
等),然后使用正则表达式查找并替换相邻的单词,但这似乎非常低效和unpythonic。
快速匹配和组合元组列表中相邻列表元素的最佳方法是什么?
答案 0 :(得分:2)
不像@ inspectorG4dget那样浮华:
words_fixed = []
last = None
for word in words:
if (last,word) in bigrams:
words_fixed.append( "%s_%s" % (last,word) )
last = None
else:
if last:
words_fixed.append( last )
last = word
if last:
words_fixed.append( last )
答案 1 :(得分:1)
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]
首先,进行一些优化:
import collections
bigrams = collections.defaultdict(set)
for w1,w2 in bigrams:
bigrams[w1].add(w2)
现在,谈到有趣的事情:
import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
if w1 in bigrams and w2 in bigrams[w1]:
words_fixed.append("%s_%s" %(w1, w2))
如果你想看到你的双字母组合中没有的单词,除了你在双字母组中记录的单词之外,那么这应该可以解决问题:
import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
if w1 in bigrams and w2 in bigrams[w1]:
words_fixed.append("%s_%s" %(w1, w2))
else:
words_fixed.append(w1)
答案 2 :(得分:1)
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow']
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]
bigrams_dict = dict(item for item in bigrams)
bigrams_dict.update(item[::-1] for item in bigrams)
words_fixed = ["{}_{}".format(word, bigrams_dict[word])
if word in bigrams_dict else word
for word in words]
[edit]另一种创建字典的方法:
from itertools import chain
bigrams_rev = (reversed(x) for x in bigrams)
bigrams_dict = dict(chain(bigrams, bigrams_rev))
答案 3 :(得分:1)
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]
print 'words :',words
print 'bigrams :',bigrams
print
def zwii(words,bigrams):
it = iter(words)
dict_bigrams = dict(bigrams)
for x in it:
if x in dict_bigrams:
try:
y = it.next()
if dict_bigrams[x] == y:
yield '-'.join((x,y))
else:
yield x
yield y
except:
yield x
else:
yield x
print list(zwii(words,bigrams))
结果
words : ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams : [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]
['apple-orange', 'boat', 'car', 'happy-day', 'cow', 'big']