在python上使用gensim Word2Vec的不同型号

时间:2016-06-10 09:55:13

标签: python nlp gensim word2vec

我正在尝试在python中应用库gensim中实现的word2vec模型。我有一个句子列表(每个句子都是一个单词列表)。

例如,让我们:

<div class="row accordian_row" style="display: table; width: 100%;">
  <div style="display: table-row;">
    <div style="display: table-cell; height: 100%; vertical-align: bottom; float: none;" class="col-lg-4"></div>
    <div style="display: table-cell; height: 100%; vertical-align: bottom; float: none;" class="col-lg-2">
      <div id="accordion1" class="panel-group">
        <div style="position: relative;" class="panel panel-default">

          <div class="panel-heading">
            <h4 class="panel-title">
              <a aria-expanded="true" class="" href="#collapse1" data-parent="#accordion1" data-toggle="collapse">Planning your honeymoon?</a>
            </h4>
          </div>

          <div style="" aria-expanded="true" class="panel-collapse collapse in" id="collapse1">
            <div class="panel-body specialblock1_content">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</div>
          </div>
        </div>
      </div>
    </div>

    <div style="display: table-cell; height: 100%; vertical-align: bottom; float: none;" class="col-lg-2">
      <div id="accordion2" class="panel-group">
        <div style="position: relative;" class="panel panel-default">

          <div class="panel-heading">
            <h4 class="panel-title">
              <a aria-expanded="true" class="" href="#collapse2" data-parent="#accordion2" data-toggle="collapse">Romance in the misty mountains</a>
            </h4>
          </div>

          <div style="" aria-expanded="true" class="panel-collapse collapse in" id="collapse2">
            <div class="panel-body specialblock1_content">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</div>
          </div>
        </div>
      </div>
    </div>
    <div style="display: table-cell; height: 100%; vertical-align: bottom; float: none;" class="col-lg-4"></div>
  </div>
</div>

我实现了两个相同的模型:

sentences=[['first','second','third','fourth']]*n

我意识到模型有时是相同的,有时是不同的,这取决于n的值。

例如,如果n = 100,我获得

model = gensim.models.Word2Vec(sententes, min_count=1,size=2)
model2=gensim.models.Word2Vec(sentences, min_count=1,size=2)

而,对于n = 1000:

print(model['first']==model2['first'])
True

怎么可能?

非常感谢!

1 个答案:

答案 0 :(得分:3)

查看gensim documentation,运行Word2Vec时会有一些随机化:

  

seed =用于随机数生成器。每个单词的初始向量用单词+ str(种子)的串联的散列来播种。请注意,对于完全确定性可重现的运行,您还必须将模型限制为单个工作线程,以消除OS线程调度中的排序抖动。

因此,如果您希望获得可重现的结果,则需要设置种子:

In [1]: import gensim

In [2]: sentences=[['first','second','third','fourth']]*1000

In [3]: model1 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2)

In [4]: model2 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2)

In [5]: print(all(model1['first']==model2['first']))
False

In [6]: model3 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2, seed = 1234)

In [7]: model4 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2, seed = 1234)

In [11]: print(all(model3['first']==model4['first']))
True