在一系列worrds中查找Bigrams

时间:2016-03-20 22:44:58

标签: python python-3.x nltk corpus

我如何在列表中找到一个二元组?例如,如果我想找到

bigram = list(nltk.bigrams("New York"))

在单词列表中

words = nltk.corpus.brown.words(fileids=["ca44"])

我试过了,

for t in bigram:
        if t in words:
             *do something*

以及

if bigram in words:
   *do something*

2 个答案:

答案 0 :(得分:2)

.bigrams()将返回元组生成器。您应该首先将元组转换为字符串。例如:

bigram_strings = [''.join(t) for t in bigram]

然后你可以做

for t in bigram_strings:
    if t in words:
         *do something*

答案 1 :(得分:1)

你可以编写一个为你的单词列表生成bigrams的生成器:

def pairwise(iterable):
    """Iterate over pairs of an iterable."""
    i = iter(iterable)
    j = iter(iterable)
    next(j)
    yield from zip(i, j)

(例如,list(pairwise(["this", "is", "a", "test"]))将返回[('this', 'is'), ('is', 'a'), ('a', 'test')]。)

然后压缩它和.bigrams()的结果:

for pair in pairwise(words):
    for bigram in nltk.bigrams("New York"):
        if bigram == pair:
            pass  # found