如何使用从python

时间:2016-11-10 12:11:38

标签: python wikipedia gensim word2vec

我想从维基百科摘要页面提取数据"机器学习"然后使用该数据构建带有gensim库的word2vec模型。

所以,首先我得到"机器学习"的维基总结。 (Wikipedia API for Python):

sentences = wikipedia.summary("machine learning")

然后我创建模型:

model = gensim.models.Word2Vec(sentences, min_count=2, size=50, window=4)

问题在于,如果我打印词汇表键,我会得到一个字符列表而不是一个单词列表。以下是我用来打印词汇表键的代码:

print list(model.vocab.keys())

我哪里错了?

我在这里粘贴了完整的代码:

import wikipedia, gensim.models
sentences = wikipedia.summary("machine learning")
model = gensim.models.Word2Vec(sentences, min_count=2, size=50, window=4)
print list(model.vocab.keys())

1 个答案:

答案 0 :(得分:2)

你遗漏了以下两件事:

  1. 将unicode转换为UTF-8
  2. 使用 gensim.models.word2vec.LineSentence 制作gensim对象
  3. 以下是完整的python脚本:

    # libraries
    from gensim.models import Word2Vec
    from gensim.models.word2vec import LineSentence
    import wikipedia
    
    # word2vec model parameters
    min_count = 2
    size = 50
    window = 4
    
    # getting "machine learning" summary from wikipedia
    summary = wikipedia.summary("machine learning")
    
    # Changing unicode to UTF-8 and writing summary to a text file
    text = summary.encode("UTF-8")
    filewriter = open("machine_learning.txt", "w")
    filewriter.write(text)
    filewriter.close()
    
    # reading machine_learning.txt file by using LineSentence
    sentences = LineSentence("machine_learning.txt")
    
    # making gensim model and training it on sentences
    model = Word2Vec(sentences, min_count = min_count, size = size, window = window)
    
    # printing model's vocablury
    print(model.vocab.keys())
    
    # printing vector for 'learning' word
    print(model["learning"])
    

    希望它有帮助..!