如何查找包含预定义单词的二元语法?

时间:2018-12-18 19:22:10

标签: python nlp nltk

我知道可以从以下链接的示例中找到具有特定单词的双字母组:

finder = BigramCollocationFinder.from_words(text.split())
word_filter = lambda w1, w2: "man" not in (w1, w2)
finder.apply_ngram_filter(word_filter)

bigram_measures = nltk.collocations.BigramAssocMeasures()
raw_freq_ranking = finder.nbest(bigram_measures.raw_freq, 10) #top-10
    >>> 

nltk: how to get bigrams containing a specific word

但是我不确定如果需要包含两个预定义单词的双字母组合,该如何应用?

示例:

我的句子:"hello, yesterday I have seen a man walking. On the other side there was another man yelling: "who are you, man?"

给出列表:["yesterday", "other", "I", "side"] 如何获得带有给定单词的二元语法列表。即: [("yesterday", "I"), ("other", "side")]

2 个答案:

答案 0 :(得分:1)

您想要的可能是一个word_filter函数,仅当特定双字组中的所有单词都在列表中时才返回False

def word_filter(x, y):
    if x in lst and y in lst:
        return False
    return True

其中lst = ["yesterday", "I", "other", "side"]

请注意,此函数正在从外部范围访问lst-这很危险,因此请确保不要在lst函数中对word_filter进行任何更改

答案 1 :(得分:0)

首先,您可以为词汇表创建所有可能的双字母组,并将其作为countVectorizer的输入,这可以将给定的文本转换为双字母组计数。

然后,根据countVectorizer给出的计数来过滤生成的二元组。

注意:我已经更改了令牌模式,以解决单个字符的问题。默认情况下,它会跳过单个字符。

from sklearn.feature_extraction.text import CountVectorizer
import itertools

corpus = ["hello, yesterday I have seen a man walking. On the other side there was another man yelling: who are you, man?"]
unigrams=["yesterday", "other", "I", "side"]
bi_grams=[' '.join(bi_gram).lower() for bi_gram in itertools.combinations(unigrams, 2)]
vectorizer = CountVectorizer(vocabulary=bi_grams,ngram_range=(2,2),token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(corpus)
print([word for count,word in zip(X.sum(0).tolist()[0],vectorizer.get_feature_names()) if count]) 

输出:

['yesterday i', 'other side']

当词汇表中的文档数量较多且单词数量较少时,这种方法将是更好的方法。如果还有其他问题,您可以先找到文档中的所有双字母组,然后使用词汇表对其进行过滤。