我正在尝试使用gensim生成二元语法,但是gensim使用了搭配定理的概念,该理论主要基于某些短语的共现。
我只是按照以下方式查找二元语法。
"I", "read", "a", "book", "about", "the", "history", "of", "America"
"I read", "read a", "a book", "book about", "about the", "the history", "history of", "of America"
可以使用的参考代码:
from gensim.test.utils import datapath
from gensim.models.word2vec import Text8Corpus
from gensim.models.phrases import Phrases, Phraser
sentences = Text8Corpus(datapath('testcorpus.txt'))
phrases = Phrases(sentences, min_count=1, threshold=1) # train model
phrases[[u'trees', u'graph', u'minors']]