我有一个带有很多单词向量的Word2Vec模型。我可以这样访问一个单词向量。
******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
Ipopt is released as open source code under the Eclipse Public License (EPL).
For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************
This is Ipopt version 3.12.8, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).
Number of nonzeros in equality constraint Jacobian...: 12
Number of nonzeros in inequality constraint Jacobian.: 0
Number of nonzeros in Lagrangian Hessian.............: 0
Total number of variables............................: 4
variables with only lower bounds: 0
variables with lower and upper bounds: 0
variables with only upper bounds: 0
Total number of equality constraints.................: 3
Total number of inequality constraints...............: 0
inequality constraints with only lower bounds: 0
inequality constraints with lower and upper bounds: 0
inequality constraints with only upper bounds: 0
Objective value at iteration #0 is - -0.4096
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
0 -4.0960000e-01 2.88e-01 2.53e-02 0.0 0.00e+00 - 0.00e+00 0.00e+00 0
Objective value at iteration #1 is - -0.255391
1 -2.5539060e-01 1.28e-02 2.98e-01 -11.0 2.51e-01 - 1.00e+00 1.00e+00h 1
Objective value at iteration #2 is - -0.249299
2 -2.4929898e-01 8.29e-05 3.73e-01 -11.0 7.77e-03 - 1.00e+00 1.00e+00h 1
Objective value at iteration #3 is - -0.25077
3 -2.5076955e-01 1.32e-03 3.28e-01 -11.0 2.46e-02 - 1.00e+00 1.00e+00h 1
Objective value at iteration #4 is - -0.250025
4 -2.5002535e-01 4.06e-05 1.93e-02 -11.0 4.65e-03 - 1.00e+00 1.00e+00h 1
Objective value at iteration #5 is - -0.25
5 -2.5000038e-01 6.57e-07 1.70e-04 -11.0 5.46e-04 - 1.00e+00 1.00e+00h 1
Objective value at iteration #6 is - -0.25
6 -2.5000001e-01 2.18e-08 2.20e-06 -11.0 9.69e-05 - 1.00e+00 1.00e+00h 1
Objective value at iteration #7 is - -0.25
7 -2.5000000e-01 3.73e-12 4.42e-10 -11.0 1.27e-06 - 1.00e+00 1.00e+00h 1
Number of Iterations....: 7
(scaled) (unscaled)
Objective...............: -2.5000000000225586e-01 -2.5000000000225586e-01
Dual infeasibility......: 4.4218750883118219e-10 4.4218750883118219e-10
Constraint violation....: 3.7250202922223252e-12 3.7250202922223252e-12
Complementarity.........: 0.0000000000000000e+00 0.0000000000000000e+00
Overall NLP error.......: 4.4218750883118219e-10 4.4218750883118219e-10
Number of objective function evaluations = 8
Number of objective gradient evaluations = 8
Number of equality constraint evaluations = 8
Number of inequality constraint evaluations = 0
Number of equality constraint Jacobian evaluations = 8
Number of inequality constraint Jacobian evaluations = 0
Number of Lagrangian Hessian evaluations = 0
Total CPU secs in IPOPT (w/o function evaluations) = 0.016
Total CPU secs in NLP function evaluations = 0.000
EXIT: Optimal Solution Found.
[ 0.79370053 0.70710678 0.52973155 0.84089641]
{'x': array([ 0.79370053, 0.70710678, 0.52973155, 0.84089641]), 'g': array([ 3.72502029e-12, -3.93685085e-13, 5.86974913e-13]), 'obj_val': -0.25000000000225586, 'mult_g': array([ 0.49999999, -0.47193715, 0.35355339]), 'mult_x_L': array([ 0., 0., 0., 0.]), 'mult_x_U': array([ 0., 0., 0., 0.]), 'status': 0, 'status_msg': b'Algorithm terminated successfully at a locally optimal point, satisfying the convergence tolerances (can be specified by options).'}
输出
word_vectors = gensim.models.Word2Vec.load(wordspace_path)
print(word_vectors['boy'])
现在我有一个合适的矢量表示,我想用word替换word_vectors ['boy']。
[ -5.48055351e-01 1.08748421e-01 -3.50534245e-02 -9.02988110e-03...]
但是抛出了以下错误
word_vectors['boy'] = [ -7.48055351e-01 3.08748421e-01 -2.50534245e-02 -10.02988110e-03...]
有没有时尚或解决方法呢?一旦训练模型,那就是手动操纵单词向量?除了Gensim之外的其他平台有可能吗?
答案 0 :(得分:9)
由于word2vec向量通常仅由迭代训练过程创建,然后被访问,因此gensim Word2Vec
对象不支持通过其单词索引直接分配新值。
然而,正如在Python中一样,它的所有内部结构都是完全可见/可篡改的,并且由于它是开源的,您可以准确地查看它如何完成所有现有功能,并将其用作模型如何做新事物。
具体来说,原始单词向量(在最新版本的gensim中)存储在名为Word2Vec
的{{1}}对象的属性中,而此wv
属性是{的一个实例{1}}。如果你检查它的源代码,你可以通过字符串键(例如wv
)查看字向量的访问,包括KeyedVectors
的那些 - 由'boy'
方法实现的索引,通过它方法[]
。您可以在本地安装或Github中查看该方法的来源:
在那里你会看到这个单词实际上被转换为整数索引(通过__getitem__()
)然后用于访问内部word_vec()
或self.vocab[word].index
数组(取决于是否用户正在访问原始或单位规范化的向量)。如果您查看设置这些内容的其他位置,或者只是在您自己的控制台/代码中检查它们(就像syn0
一样),您会看到这些是做的syn0norm
数组支持按索引直接分配。
所以,可以通过整数索引直接篡改其值,如下所示:
word_vectors.wv.syn0
然后,numpy
的未来访问将返回您更新的值。
注意:
•如果您想要更新word_vectors.wv.syn0[word_vectors.wv.vocab['boy'].index] = [ -7.48055351e-01 3.08748421e-01 -2.50534245e-02 -10.02988110e-03...]
,要获得正确的单位标准向量(如word_vectors.wv['boy']
和其他操作中所使用的那样),最好修改{{}} {1}}首先,然后通过以下方式丢弃并重新计算syn0norm
most_similar()
•添加新单词需要更多涉及对象篡改,因为它需要增长syn0
(用更大的数组替换它),并更新syn0norm
dict