如何在Gensim Word2Vec中手动更改单词的矢量尺寸

时间:2017-10-09 13:41:43

标签: python vector gensim word2vec vector-space

我有一个带有很多单词向量的Word2Vec模型。我可以这样访问一个单词向量。

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

This is Ipopt version 3.12.8, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Number of nonzeros in equality constraint Jacobian...:       12
Number of nonzeros in inequality constraint Jacobian.:        0
Number of nonzeros in Lagrangian Hessian.............:        0

Total number of variables............................:        4
                     variables with only lower bounds:        0
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        3
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0

Objective value at iteration #0 is - -0.4096
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
   0 -4.0960000e-01 2.88e-01 2.53e-02   0.0 0.00e+00    -  0.00e+00 0.00e+00   0
Objective value at iteration #1 is - -0.255391
   1 -2.5539060e-01 1.28e-02 2.98e-01 -11.0 2.51e-01    -  1.00e+00 1.00e+00h  1
Objective value at iteration #2 is - -0.249299
   2 -2.4929898e-01 8.29e-05 3.73e-01 -11.0 7.77e-03    -  1.00e+00 1.00e+00h  1
Objective value at iteration #3 is - -0.25077
   3 -2.5076955e-01 1.32e-03 3.28e-01 -11.0 2.46e-02    -  1.00e+00 1.00e+00h  1
Objective value at iteration #4 is - -0.250025
   4 -2.5002535e-01 4.06e-05 1.93e-02 -11.0 4.65e-03    -  1.00e+00 1.00e+00h  1
Objective value at iteration #5 is - -0.25
   5 -2.5000038e-01 6.57e-07 1.70e-04 -11.0 5.46e-04    -  1.00e+00 1.00e+00h  1
Objective value at iteration #6 is - -0.25
   6 -2.5000001e-01 2.18e-08 2.20e-06 -11.0 9.69e-05    -  1.00e+00 1.00e+00h  1
Objective value at iteration #7 is - -0.25
   7 -2.5000000e-01 3.73e-12 4.42e-10 -11.0 1.27e-06    -  1.00e+00 1.00e+00h  1

Number of Iterations....: 7

                                   (scaled)                 (unscaled)
Objective...............:  -2.5000000000225586e-01   -2.5000000000225586e-01
Dual infeasibility......:   4.4218750883118219e-10    4.4218750883118219e-10
Constraint violation....:   3.7250202922223252e-12    3.7250202922223252e-12
Complementarity.........:   0.0000000000000000e+00    0.0000000000000000e+00
Overall NLP error.......:   4.4218750883118219e-10    4.4218750883118219e-10


Number of objective function evaluations             = 8
Number of objective gradient evaluations             = 8
Number of equality constraint evaluations            = 8
Number of inequality constraint evaluations          = 0
Number of equality constraint Jacobian evaluations   = 8
Number of inequality constraint Jacobian evaluations = 0
Number of Lagrangian Hessian evaluations             = 0
Total CPU secs in IPOPT (w/o function evaluations)   =      0.016
Total CPU secs in NLP function evaluations           =      0.000

EXIT: Optimal Solution Found.
[ 0.79370053  0.70710678  0.52973155  0.84089641]
{'x': array([ 0.79370053,  0.70710678,  0.52973155,  0.84089641]), 'g': array([  3.72502029e-12,  -3.93685085e-13,   5.86974913e-13]), 'obj_val': -0.25000000000225586, 'mult_g': array([ 0.49999999, -0.47193715,  0.35355339]), 'mult_x_L': array([ 0.,  0.,  0.,  0.]), 'mult_x_U': array([ 0.,  0.,  0.,  0.]), 'status': 0, 'status_msg': b'Algorithm terminated successfully at a locally optimal point, satisfying the convergence tolerances (can be specified by options).'}

输出

word_vectors = gensim.models.Word2Vec.load(wordspace_path)
print(word_vectors['boy'])

现在我有一个合适的矢量表示,我想用word替换word_vectors ['boy']。

[ -5.48055351e-01   1.08748421e-01  -3.50534245e-02  -9.02988110e-03...]

但是抛出了以下错误

word_vectors['boy'] = [ -7.48055351e-01   3.08748421e-01  -2.50534245e-02  -10.02988110e-03...]

有没有时尚或解决方法呢?一旦训练模型,那就是手动操纵单词向量?除了Gensim之外的其他平台有可能吗?

1 个答案:

答案 0 :(得分:9)

由于word2vec向量通常仅由迭代训练过程创建,然后被访问,因此gensim Word2Vec对象不支持通过其单词索引直接分配新值。

然而,正如在Python中一样,它的所有内部结构都是完全可见/可篡改的,并且由于它是开源的,您可以准确地查看它如何完成所有现有功能,并将其用作模型如何做新事物。

具体来说,原始单词向量(在最新版本的gensim中)存储在名为Word2Vec的{​​{1}}对象的属性中,而此wv属性是{的一个实例{1}}。如果你检查它的源代码,你可以通过字符串键(例如wv)查看字向量的访问,包括KeyedVectors的那些 - 由'boy'方法实现的索引,通过它方法[]。您可以在本地安装或Github中查看该方法的来源:

https://github.com/RaRe-Technologies/gensim/blob/c2201664d5ae03af8d90fb5ff514ffa48a6f305a/gensim/models/keyedvectors.py#L265

在那里你会看到这个单词实际上被转换为整数索引(通过__getitem__())然后用于访问内部word_vec()self.vocab[word].index数组(取决于是否用户正在访问原始或单位规范化的向量)。如果您查看设置这些内容的其他位置,或者只是在您自己的控制台/代码中检查它们(就像syn0一样),您会看到这些是做的syn0norm数组支持按索引直接分配。

所以,可以通过整数索引直接篡改其值,如下所示:

word_vectors.wv.syn0

然后,numpy的未来访问将返回您更新的值。

注意:

•如果您想要更新word_vectors.wv.syn0[word_vectors.wv.vocab['boy'].index] = [ -7.48055351e-01 3.08748421e-01 -2.50534245e-02 -10.02988110e-03...] ,要获得正确的单位标准向量(如word_vectors.wv['boy']和其他操作中所使用的那样),最好修改{{}} {1}}首先,然后通过以下方式丢弃并重新计算syn0norm

most_similar()

•添加新单词需要更多涉及对象篡改,因为它需要增长syn0(用更大的数组替换它),并更新syn0norm dict