从文档中我想生成包含某个单词的所有n-gram。
示例:
document: i am 50 years old, my son is 20 years old
word: years
n: 2
输出:
[(50, years), (years, old), (20, years), (years, old)]
我知道我们可以生成所有可能的n-gram并过滤掉那些单词,但我想知道是否有更有效的方法来做到这一点。我打算使用PySpark来生成它们。
答案 0 :(得分:0)
from nltk.util import ngrams
DOC = 'i am 50 years old, my son is 20 years old'
def ngram_filter(doc, word, n):
tokens = doc.split()
all_ngrams = ngrams(tokens, n)
filtered_ngrams = [x for x in all_ngrams if word in x]
return filtered_ngrams
ngram_filter(DOC, 'years', 2)