我想从维基百科摘要页面提取数据"机器学习"然后使用该数据构建带有gensim库的word2vec模型。
所以,首先我得到"机器学习"的维基总结。 (Wikipedia API for Python):
sentences = wikipedia.summary("machine learning")
然后我创建模型:
model = gensim.models.Word2Vec(sentences, min_count=2, size=50, window=4)
问题在于,如果我打印词汇表键,我会得到一个字符列表而不是一个单词列表。以下是我用来打印词汇表键的代码:
print list(model.vocab.keys())
我哪里错了?
我在这里粘贴了完整的代码:
import wikipedia, gensim.models
sentences = wikipedia.summary("machine learning")
model = gensim.models.Word2Vec(sentences, min_count=2, size=50, window=4)
print list(model.vocab.keys())
答案 0 :(得分:2)
你遗漏了以下两件事:
以下是完整的python脚本:
# libraries
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import wikipedia
# word2vec model parameters
min_count = 2
size = 50
window = 4
# getting "machine learning" summary from wikipedia
summary = wikipedia.summary("machine learning")
# Changing unicode to UTF-8 and writing summary to a text file
text = summary.encode("UTF-8")
filewriter = open("machine_learning.txt", "w")
filewriter.write(text)
filewriter.close()
# reading machine_learning.txt file by using LineSentence
sentences = LineSentence("machine_learning.txt")
# making gensim model and training it on sentences
model = Word2Vec(sentences, min_count = min_count, size = size, window = window)
# printing model's vocablury
print(model.vocab.keys())
# printing vector for 'learning' word
print(model["learning"])
希望它有帮助..!