我目前正在尝试使用this Notebook作为我的参考,创建双字母组合和三元组以重新创建我的语料库,从单词到单词和短语。但是,我认为应该从代码中产生的短语没有被编译。
以下是我正在使用的代码:
unigram_sentences = LineSentence("*.csv")
for unigram_sentence in it.islice(unigram_sentences, 1, 5):
print (u' '.join(unigram_sentence))
print (u'')
intermediate_directory = os.path.join('.../2015/TEMP')
bigram_model_filepath = os.path.join(intermediate_directory,'bigram_model_all')
%%time
bigram_model = Phrases(unigram_sentences)
bigram_model.save(bigram_model_filepath)
# load the finished model from disk
bigram_model = Phrases.load(bigram_model_filepath)
bigram_sentences_filepath = os.path.join(intermediate_directory,
'bigram_sentences_all.txt')
%%time
with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
for unigram_sentence in unigram_sentences:
bigram_sentence = u' '.join(bigram_model[unigram_sentence])
f.write(bigram_sentence + '\n')
bigram_sentences = LineSentence(bigram_sentences_filepath)
for bigram_sentence in it.islice(bigram_sentences, 1, 5):
print (u' '.join(bigram_sentence))
print (u'')
实际上我的Ins(unigram句子)是:
虽然我的出局(双字母句子)是:
虽然代码确实结合了诸如bbc_news和the_rise之类的短语,但我真正期望的是看到mental_health被组合在一起。
问题:我做错了什么? :/
感谢您的帮助,并为一个混乱的第一次定时发布道歉!
Alina