在python中加载预训练的手套矢量

时间:2016-06-13 15:01:18

标签: python-2.7 vector nlp

我从互联网上下载了预训练的手套矢量文件。它是一个.txt文件。我无法加载和访问它。使用gensim很容易加载和访问单词向量二进制文件,但是当它是文本文件格式时我不知道该怎么做。

提前致谢

10 个答案:

答案 0 :(得分:51)

手套模型文件采用单词 - 矢量格式。您可以打开文本文件来验证这一点。这是一小段代码,可用于加载预训练的手套文件:

import numpy as np
def loadGloveModel(gloveFile):
    print("Loading Glove Model")
    f = open(gloveFile,'r')
    model = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print("Done.",len(model)," words loaded!")
    return model

然后,您只需使用模型变量即可访问单词向量。

print model['hello']

答案 1 :(得分:37)

使用pandas可以更快地完成任务:

import pandas as pd
import csv

words = pd.read_table(glove_data_file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

然后获取单词的向量:

def vec(w):
  return words.loc[w].as_matrix()

找到最接近矢量的单词:

words_matrix = words.as_matrix()

def find_closest_word(v):
  diff = words_matrix - v
  delta = np.sum(diff * diff, axis=1)
  i = np.argmin(delta)
  return words.iloc[i].name

答案 2 :(得分:23)

我建议使用gensim来做所有事情。您可以阅读该文件,并且还可以在此优秀软件包上实施许多方法,从中受益。

假设您使用C ++程序生成了GloVe个向量,并且您的" -save-file"参数是"向量"。手套可执行文件将生成两个文件," vectors.bin"和" vectors.txt"。

使用glove2word2vec将文本格式的GloVe向量转换为word2vec文本格式:

from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file="vectors.txt", word2vec_output_file="gensim_glove_vectors.txt")

最后,使用KeyedVectors将word2vec txt读入gensim模型:

from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

现在你可以像你一样使用gensim word2vec方法(例如,相似性)。

答案 3 :(得分:4)

如果你想要的只是嵌入矩阵

,这里是一个单行

np.loadtxt(path, usecols=range(1, dim+1), comments=None)

其中path是下载的GloVe文件的路径,dim是嵌入字的维度。

如果你想要两个单词和相应的矢量,你可以做

glove = np.loadtxt(path, dtype='str', comments=None)

并将单词和向量分开如下

words = glove[:, 0]
vectors = glove[:, 1:].astype('float')

答案 4 :(得分:1)

我发现这种方法更快。

import pandas as pd

df = pd.read_csv('glove.840B.300d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove = {key: val.values for key, val in df.T.items()}

保存字典:

import pickle
with open('glove.840B.300d.pkl', 'wb') as fp:
    pickle.dump(glove, fp)

答案 5 :(得分:1)

Python3版本也可以处理双字母组和三字母组:

import numpy as np


def load_glove_model(glove_file):
    print("Loading Glove Model")
    f = open(glove_file, 'r')
    model = {}
    vector_size = 300
    for line in f:
        split_line = line.split()
        word = " ".join(split_line[0:len(split_line) - vector_size])
        embedding = np.array([float(val) for val in split_line[-vector_size:]])
        model[word] = embedding
    print("Done.\n" + str(len(model)) + " words loaded!")
    return model

答案 6 :(得分:1)

从文本文件中加载单词嵌入(在我的情况下为 glove.42B.300d 嵌入)需要一些时间(在我的计算机上为 147.2s )。

将文本文件首先转换为两个新文件有帮助:一个仅包含单词的文本文件(例如 embeddings.vocab )和一个二进制文件文件,其中包含作为numpy结构的嵌入矢量(例如 embeddings.npy )。

转换后,只需要 4.96s 即可将相同的嵌入内容加载到内存中。这种方法以与从文本文件中加载字典完全相同的字典结束。它的访问时间效率很高,不需要任何其他框架,但是加载时间快得多。

使用此代码,您可以将嵌入的文本文件转换为两个新文件:

def convert_to_binary(embedding_path):
    f = codecs.open(embedding_path + ".txt", 'r', encoding='utf-8')
    wv = []

    with codecs.open(embedding_path + ".vocab", "w", encoding='utf-8') as vocab_write:
        count = 0
        for line in f:
            splitlines = line.split()
            vocab_write.write(splitlines[0].strip())
            vocab_write.write("\n")
            wv.append([float(val) for val in splitlines[1:]])
        count += 1

    np.save(embedding_path + ".npy", np.array(wv))

通过这种方法,您可以将其有效地加载到内存中

def load_word_emb_binary(embedding_file_name_w_o_suffix):
    print("Loading binary word embedding from {0}.vocab and {0}.npy".format(embedding_file_name_w_o_suffix))

    with codecs.open(embedding_file_name_w_o_suffix + '.vocab', 'r', 'utf-8') as f_in:
        index2word = [line.strip() for line in f_in]

    wv = np.load(embedding_file_name_w_o_suffix + '.npy')
    word_embedding_map = {}
    for i, w in enumerate(index2word):
        word_embedding_map[w] = wv[i]

    return word_embedding_map

免责声明:此代码是从https://blog.ekbana.com/loading-glove-pre-trained-word-embedding-model-from-text-file-faster-5d3e8f2b8455偷偷窃取的。但这可能有助于解决这个问题。

答案 7 :(得分:0)

import os
import numpy as np

# store all the pre-trained word vectors
print('Loading word vectors...')
word2vec = {}
with open(os.path.join('glove/glove.6B.%sd.txt' % EMBEDDING_DIM)) as f: #enter the path where you unzipped the glove file
  # is just a space-separated text file in the format:
  # word vec[0] vec[1] vec[2] ...
    for line in f:
        values = line.split()
        word = values[0]
        vec = np.asarray(values[1:], dtype='float32')
        word2vec[word] = vec
print('Found %s word vectors.' % len(word2vec))

答案 8 :(得分:0)

此代码需要花费一些时间将手套嵌入物存储在架子上,但是与其他方法相比,加载它要快得多。

import os
import numpy as np
from contextlib import closing
import shelve

def store_glove_to_shelf(glove_file_path,shelf):
    print('Loading Glove')
    with open(os.path.join(glove_file_path)) as f:
        for line in f:
            values = line.split()
            word = values[0]
            vec = np.asarray(values[1:], dtype='float32')
            shelf[word] = vec

shelf_file_name = "glove_embeddings"
glove_file_path = "glove/glove.840B.300d.txt"

# Storing glove embeddings to shelf for faster load
with closing(shelve.open(shelf_file_name + '.shelf', 'c')) as shelf:
    store_glove_to_shelf(glove_file_path,shelf)
    print("Stored glove embeddings from {} to {}".format(glove_file_path,shelf_file_name+'.shelf'))

# To reuse the glove embeddings stored in shelf
with closing(shelve.open(shelf_file_name + '.shelf')) as embeddings_index:
    # USE embeddings_index here , which is a dictionary
    print("Loaded glove embeddings from {}".format(shelf_file_name+'.shelf'))
    print("Found glove embeddings with {} words".format(len(embeddings_index)))

答案 9 :(得分:-1)

EMBEDDING_LIFE = 'path/to/your/glove.txt'

def get_coefs(word,*arr): 
      return word, np.asarray(arr, dtype='float32')

embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))

embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: embedding_matrix[i] = embedding_vector