我在用GloVe实现单词到矢量的映射时遇到了麻烦。我的代码似乎运行良好,但是有一个怪异的问题:尝试将一个特定的单词“ the”映射到其向量表示时出现错误。我不知道为什么会这样。
这是我读取GloVe文件的代码:
def read_glove_vecs(glove_file):
with open(glove_file, 'r', encoding='utf-8', errors='ignore') as f:
words = set()
word_to_vec_map = {}
for line in f:
line = line.strip().split()
curr_word = line[0]
words.add(curr_word)
word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
i = 1
words_to_index = {}
index_to_words = {}
for w in sorted(words):
words_to_index[w] = i
index_to_words[i] = w
i = i + 1
return words_to_index, index_to_words, word_to_vec_map
如您所见,上面的函数返回变量'word_to_vec_map',该变量应该将训练集中的单词映射到其GloVe表示形式。
这是训练集中的摘录:
I am proud of your achievements,2,,
Miss you so much,0,, [0]
food is life,4,,
I love you mum,0,,
Stop saying bullshit,3,,
congratulations on your acceptance,2,,
The assignment is too long ,3,,
I want to go play,1,, [3]
似乎我能够使用word_to_vec_map映射单词:
print(word_to_vec_map['proud'])
[-0.5918 0.27671 -0.46971 -0.54743 1.3504 -0.63907 -0.6819
0.54207 -0.40552 0.11271 0.1564 0.21604 -0.035073 -0.30228
0.15753 -0.10437 0.64561 1.0843 0.28788 -0.24031 -1.2893
0.82949 -0.44547 0.11085 1.1249 -1.5474 -1.3967 0.1393
0.23133 -0.46974 1.5829 0.87095 0.13645 0.047461 -0.37914
-0.45608 0.033173 0.39443 -0.67186 -0.92765 -0.19048 -0.59441
-0.046391 0.14051 0.032863 0.42813 -1.3888 -0.20055 -0.26487
0.57981 ]
print(word_to_vec_map['much'])
[ 0.36999 0.082841 0.16883 -0.50223 0.37935 0.13343
-0.32527 -0.17964 -0.40393 0.58149 -0.14505 0.1399
-0.1566 -0.60951 0.62075 0.5596 0.35677 0.25654
-0.33583 -0.82497 -0.11897 0.21829 0.27755 -0.38194
0.54374 -1.7705 -0.74366 0.40402 0.88709 -0.021368
3.7891 0.39953 0.51627 -0.48584 -0.052367 -0.28135
-0.60422 0.46096 0.11491 -0.49699 -0.34498 0.38645
0.14052 0.43843 -0.33583 0.13546 -0.12158 0.0053184
-0.50853 0.24986 ]
print(word_to_vec_map['miss'])
[-3.2273e-01 5.6182e-01 -6.6363e-01 3.8883e-01 -4.6558e-02 2.2328e-01
-7.5691e-01 7.0853e-01 5.5714e-01 -5.9996e-02 3.1235e-01 1.6741e-01
-5.4568e-01 -3.8765e-01 1.2309e+00 3.4766e-01 -5.0017e-02 -4.9804e-02
-6.6282e-01 2.2854e-01 -7.8443e-01 6.5823e-01 5.6099e-01 3.3218e-01
5.3049e-01 -1.3611e+00 -4.9452e-01 2.7711e-01 -2.2982e-01 -1.1492e+00
1.5028e+00 1.0916e+00 -9.8464e-02 3.9349e-04 2.5753e-01 -1.5470e-01
2.7595e-01 6.4750e-01 -5.6537e-02 -1.3046e+00 -5.8200e-01 1.2838e-01
-1.1416e-01 -8.0836e-01 -8.3921e-01 2.5609e-01 1.5629e-01 -9.7299e-01
1.1130e-01 4.4500e-01]
但是然后:
print(word_to_vec_map['the'])
KeyError Traceback (most recent call last)
<ipython-input-24-ebc9756c0cc8> in <module>
----> 1 print(word_to_vec_map['the'])
KeyError: 'the'
有人知道为什么会这样吗?为什么我不能映射这个特定的单词?
答案 0 :(得分:0)
我尝试了相同的代码,它也能够为“ the”映射向量。请再次检查。
def read_glove_vecs(glove_file):
with open(glove_file, 'r', encoding='utf-8', errors='ignore') as f:
words = set()
word_to_vec_map = {}
for line in f:
line = line.strip().split()
curr_word = line[0]
words.add(curr_word)
word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
i = 1
words_to_index = {}
index_to_words = {}
for w in sorted(words):
words_to_index[w] = i
index_to_words[i] = w
i = i + 1
return words_to_index, index_to_words, word_to_vec_map
word_to_index, index_to_words, word_to_vec_map = read_glove_vecs(GLOVE_EMBEDDING_PATH)
word_to_vec_map['the']
array([ 4.1800e-01, 2.4968e-01, -4.1242e-01, 1.2170e-01, 3.4527e-01,
-4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
2.7843e-01, -1.4767e-01, -5.5677e-01, 1.4658e-01, -9.5095e-03,
1.1658e-02, 1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
-1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
-1.8823e+00, -7.6746e-01, 9.9051e-02, -4.2125e-01, -1.9526e-01,
4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01, 5.9213e-04,
7.4449e-03, 1.7778e-01, -1.5897e-01, 1.2041e-02, -5.4223e-02,
-2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
1.8785e-01, 2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01])
答案 1 :(得分:0)
我通过从此处下载带有预训练的GloVe向量的另一个文件来解决了这个问题: https://www.kaggle.com/rtatman/glove-global-vectors-for-word-representation#glove.6B.50d.txt