无法将某个单词映射到矢量

时间:2019-07-04 16:40:22

标签: nlp word-embedding glove

我在用GloVe实现单词到矢量的映射时遇到了麻烦。我的代码似乎运行良好,但是有一个怪异的问题:尝试将一个特定的单词“ the”映射到其向量表示时出现错误。我不知道为什么会这样。

这是我读取GloVe文件的代码:

def read_glove_vecs(glove_file):
    with open(glove_file, 'r', encoding='utf-8', errors='ignore') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)

        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map

如您所见,上面的函数返回变量'word_to_vec_map',该变量应该将训练集中的单词映射到其GloVe表示形式。

这是训练集中的摘录:

I am proud of your achievements,2,,
Miss you so much,0,, [0]
food is life,4,,
I love you mum,0,,
Stop saying bullshit,3,,
congratulations on your acceptance,2,,
The assignment is too long ,3,,
I want to go play,1,, [3]

似乎我能够使用word_to_vec_map映射单词:


print(word_to_vec_map['proud'])

[-0.5918    0.27671  -0.46971  -0.54743   1.3504   -0.63907  -0.6819
  0.54207  -0.40552   0.11271   0.1564    0.21604  -0.035073 -0.30228
  0.15753  -0.10437   0.64561   1.0843    0.28788  -0.24031  -1.2893
  0.82949  -0.44547   0.11085   1.1249   -1.5474   -1.3967    0.1393
  0.23133  -0.46974   1.5829    0.87095   0.13645   0.047461 -0.37914
 -0.45608   0.033173  0.39443  -0.67186  -0.92765  -0.19048  -0.59441
 -0.046391  0.14051   0.032863  0.42813  -1.3888   -0.20055  -0.26487
  0.57981 ]

print(word_to_vec_map['much'])

[ 0.36999    0.082841   0.16883   -0.50223    0.37935    0.13343
 -0.32527   -0.17964   -0.40393    0.58149   -0.14505    0.1399
 -0.1566    -0.60951    0.62075    0.5596     0.35677    0.25654
 -0.33583   -0.82497   -0.11897    0.21829    0.27755   -0.38194
  0.54374   -1.7705    -0.74366    0.40402    0.88709   -0.021368
  3.7891     0.39953    0.51627   -0.48584   -0.052367  -0.28135
 -0.60422    0.46096    0.11491   -0.49699   -0.34498    0.38645
  0.14052    0.43843   -0.33583    0.13546   -0.12158    0.0053184
 -0.50853    0.24986  ]

print(word_to_vec_map['miss'])

[-3.2273e-01  5.6182e-01 -6.6363e-01  3.8883e-01 -4.6558e-02  2.2328e-01
 -7.5691e-01  7.0853e-01  5.5714e-01 -5.9996e-02  3.1235e-01  1.6741e-01
 -5.4568e-01 -3.8765e-01  1.2309e+00  3.4766e-01 -5.0017e-02 -4.9804e-02
 -6.6282e-01  2.2854e-01 -7.8443e-01  6.5823e-01  5.6099e-01  3.3218e-01
  5.3049e-01 -1.3611e+00 -4.9452e-01  2.7711e-01 -2.2982e-01 -1.1492e+00
  1.5028e+00  1.0916e+00 -9.8464e-02  3.9349e-04  2.5753e-01 -1.5470e-01
  2.7595e-01  6.4750e-01 -5.6537e-02 -1.3046e+00 -5.8200e-01  1.2838e-01
 -1.1416e-01 -8.0836e-01 -8.3921e-01  2.5609e-01  1.5629e-01 -9.7299e-01
  1.1130e-01  4.4500e-01]

但是然后:

print(word_to_vec_map['the'])

KeyError                                  Traceback (most recent call last)
<ipython-input-24-ebc9756c0cc8> in <module>
----> 1 print(word_to_vec_map['the'])

KeyError: 'the'

有人知道为什么会这样吗?为什么我不能映射这个特定的单词?

2 个答案:

答案 0 :(得分:0)

我尝试了相同的代码,它也能够为“ the”映射向量。请再次检查。

def read_glove_vecs(glove_file):
    with open(glove_file, 'r', encoding='utf-8', errors='ignore') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)

        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map

word_to_index, index_to_words, word_to_vec_map = read_glove_vecs(GLOVE_EMBEDDING_PATH)
word_to_vec_map['the']

array([ 4.1800e-01,  2.4968e-01, -4.1242e-01,  1.2170e-01,  3.4527e-01,
       -4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
        2.7843e-01, -1.4767e-01, -5.5677e-01,  1.4658e-01, -9.5095e-03,
        1.1658e-02,  1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
       -1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
       -1.8823e+00, -7.6746e-01,  9.9051e-02, -4.2125e-01, -1.9526e-01,
        4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01,  5.9213e-04,
        7.4449e-03,  1.7778e-01, -1.5897e-01,  1.2041e-02, -5.4223e-02,
       -2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
        1.8785e-01,  2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01])

答案 1 :(得分:0)

我通过从此处下载带有预训练的GloVe向量的另一个文件来解决了这个问题: https://www.kaggle.com/rtatman/glove-global-vectors-for-word-representation#glove.6B.50d.txt