为什么我无法使用gensim重现word2vec结果

时间:2015-07-20 19:08:05

标签: python gensim word2vec

我无法使用Gensim重现word2vec结果,并且一些结果没有意义。 Gensim是一个开源工具包,旨在使用高效的在线算法处理大型文本集合,包括python implementation of Google's word2vec algorithm

我正在关注online tutorial,但无法重现结果。 (positive = ['woman','king'],negative = ['man'])最相似的词应该是'wenceslaus'和'queen'。相反,我得到了'u'eleonore'和'iv'。 “快速”最相似的是“慢”,而“快速”则是“mitsumi”。

任何见解?以下是我的代码和结果:

  

>>>来自gensim.models导入word2vec

     

>>>导入日志记录

     

>>> logging.basicConfig(format ='%(asctime)s:%(levelname)s:%(message)s',level = logging.INFO)

     

>>> sentence = word2vec.Text8Corpus('\ tmp \ text8')

     

>>> model = word2vec.Word2Vec(句子,大小= 200)

     

>>> model.most_similar(positive = ['woman','king'],negative = ['man'],topn = 2)

     

out [63]:[(u'eleonore',0.5138808 ......),(u'iv',0.510519325 ......)]

     

>>> model.most_similar(阳性= [ '快'])

     

Out [64]:[(u'slow',0.48932 ...),(u'paced',0.46925 ...)...]

     

>>> model.most_similar(阳性= [ '快'],TOPN = 1)

     

out [65]:[(u'mitsumi',0.48545 ..)]

1 个答案:

答案 0 :(得分:2)

您的结果有意义。

word2vec有几个原因导致它的随机性 - 随机向量初始化,线程化等等 - 所以你不会在教程中得到完全相同的结果并不奇怪。 / p>

此外,“eleonore”是公主的名字,“iv”是罗马数字;这两个术语都与期望的“女王”有关。如果对结果持怀疑态度,请尝试检查文本本身:

>>> import nltk
>>> with open('/tmp/text8', 'r') as f:
>>>     text = nltk.Text(f.read().split()
>>> text.concordance('eleonore')

Displaying 6 of 6 matches:
en the one eight year old princess eleonore of portugal whose dowry helped him
nglish historian one six five five eleonore gonzaga wife of ferdinand ii holy 
riage in one six zero three was to eleonore of hohenzollern born one five eigh
frederick duke of prussia and mary eleonore of kleve children of joachim frede
ive child of joachim frederick and eleonore of hohenzollern marie eleonore bor
and eleonore of hohenzollern marie eleonore born two two march one six zero se

但是,如果您仍对结果不满意,可以选择以下内容:

  1. 尝试多次运行。它们都会产生不同的向量。 (虽然不是一个聪明的方法。)
  2. 尝试更大的topn并观察超过一两个相似的术语。 “eleonore”或“iv”可能与“女王”有着密切的竞争对手。

    >>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=20)
    [('iii', 0.51035475730896), ('vii', 0.5096821188926697), ('frederick', 0.5058648586273193), ('son', 0.5021922588348389), ('wenceslaus', 0.500456690788269), ('eleonore', 0.49771684408187866), ('iv', 0.4948177933692932), ('henry', 0.49309787154197693), ('viii', 0.4924878478050232), ('sigismund', 0.49033164978027344), ('letsie', 0.4879177212715149), ('wladislaus', 0.4867924451828003), ('boleslaus', 0.47995251417160034), ('dagobert', 0.4767090082168579), ('corvinus', 0.476703941822052), ('abdicates', 0.47494029998779297), ('jadwiga', 0.4712049961090088), ('eldest', 0.4683353900909424), ('anjou', 0.46781229972839355), ('queen', 0.46647682785987854)]
    
  3. 尝试调整单词min_count。这将帮助您删除不常见的,看似“嘈杂”的单词。 (默认min_count为5。)

    >>> model = word2vec.Word2Vec(sentences, size=200, min_count=30)
    >>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=20)
    [('queen', 0.5332179665565491), ('son', 0.5205873250961304), ('daughter', 0.49179190397262573), ('henry', 0.4898293614387512), ('antipope', 0.4872135818004608), ('eldest', 0.48199930787086487), ('viii', 0.47991085052490234), ('matilda', 0.4746955633163452), ('iii', 0.4663817882537842), ('duke', 0.46338942646980286), ('jadwiga', 0.4630076289176941), ('vii', 0.45885157585144043), ('aquitaine', 0.45757925510406494), ('vasa', 0.45703941583633423), ('pretender', 0.4559580683708191), ('reigned', 0.4528595805168152), ('marries', 0.4490123391151428), ('philip', 0.44660788774490356), ('anne', 0.4405106008052826), ('princess', 0.43850386142730713)]