Question

我想使用gensim从语料库中学习二元语言，然后仅打印所学的二元语言。我还没有看到执行此操作的示例。感谢帮助

from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream)

# how can I print all bigrams learned and just the bigrams, including "new_york" and "human computer" ?enter code here

Answer 1

如果您使用上述的Phrases类训练模型，并在不持久化模型的情况下打印二元组，则OP的答案将起作用。当您保存模型并在以后再次加载时，它将不起作用。保存后加载模型时，需要使用Phraser类，如下所示：

from gensim.models.phrases import Phraser

然后加载模型：

bigram_model = Phraser.load('../../whatever_bigram_model')

然后，如果您确实使用以下方法作为提及的OP的答案，即

OP的回答

import operator
sorted(
    {k:v for k,v in bigram_model.vocab.items() if b'_' in k if v>=bigram_model.min_count}.items(),
    key=operator.itemgetter(1),
    reverse=True)

您会看到一条错误消息：

AttributeError: 'Phraser' object has no attribute 'vocab'

解决方案

解决方法如下：

for bigram in bigram_model.phrasegrams.keys():
    print(bigram)

输出：

(b'word1', b'word2')
(b'word3', b'word4')

对于持久性模型和非持久性模型，此解决方案在两种情况下均有效，在OP给出的示例中，我的解决方案的修改版本为：

for ngrams, _ in bigram.vocab.items():
    unicode_ngrams = ngrams.decode('utf-8')
    if '_' in unicode_ngrams:
        print(unicode_ngrams)

礼物：

the_mayor
mayor_of
of_new
new_york
york_was
was_there
human_computer
computer_interaction
interaction_and
and_machine
machine_learning
learning_has
has_now
now_become

为了得到答案的长度，输出中有更多内容，但是我将其截短了

我希望我的回答可以帮助您更清楚。

Answer 2

import operator
sorted(
    {k:v for k,v in bigram.vocab.items() if b'_' in k if v>=bigram.min_count}.items(),
    key=operator.itemgetter(1),
    reverse=True)

用gensim打印打印的二元组

2 个答案: