Question

我正尝试使用Gensim在一些英文单词上导入GoogelNews预训练的模型（此处采样的15个单词仅存储在txt文件中，每行每行，没有更多的上下文作为语料库）。然后，我可以使用“ model.most_similar（）”为他们获取相似的词/短语。但是实际上从Python-Pickle方法加载的文件不能直接用于gensim内置的model.load()和model.most_similar()函数。

由于我从一开始就无法训练，保存和加载模型，我应该如何对15个英语单词进行聚类（以后还会更多）？

import gensim
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors

GOOGLE_WORD2VEC_MODEL = '../GoogleNews-vectors-negative300.bin'

GOOGLE_ENGLISH_WORD_PATH = '../testwords.txt'

GOOGLE_WORD_FEATURE = '../word.google.vector'

model = gensim.models.KeyedVectors.load_word2vec_format(GOOGLE_WORD2VEC_MODEL, binary=True) 

word_vectors = {}

#load 15 words as a test to word_vectors

with open(GOOGLE_ENGLISH_WORD_PATH) as f:
    lines = f.readlines()
    for line in lines:
        line = line.strip('\n')
        if line:                
            word = line
            print(line)
            word_vectors[word]=None
try:
    import cPickle
except :
    import _pickle as cPickle

def save_model(clf,modelpath): 
    with open(modelpath, 'wb') as f: 
        cPickle.dump(clf, f) 

def load_model(modelpath): 
    try: 
        with open(modelpath, 'rb') as f: 
            rf = cPickle.load(f) 
            return rf 
    except Exception as e:        
        return None 

for word in word_vectors:
    try:
        v= model[word]
        word_vectors[word] = v
    except:
        pass

save_model(word_vectors,GOOGLE_WORD_FEATURE)

words_set = load_model(GOOGLE_WORD_FEATURE)

words_set.most_similar("knit", topn=3)

---------------error message--------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-86c15e366696> in <module>
----> 1 words_set.most_similar("knit", topn=3)

AttributeError: 'dict' object has no attribute 'most_similar'
---------------error message--------

Answer 1

您已将word_vectors定义为Python dict：

word_vectors = {}

然后，您的save_model()函数只保存该原始dict，而您的load_model()则加载相同的原始dict。

此类词典对象不实现most_similar()方法，该方法特定于KeyedVectors的{{1}}接口（及相关类）。

因此，您必须将数据保留在类似gensim的对象中才能使用KeyedVectors。

幸运的是，您有一些选择。

如果您碰巧只需要most_similar()文件中的头个15字（或者前15,000个等），则可以使用可选的GoogleNews参数只读了那么多向量：

limit

或者，如果您确实需要选择单词的任意子集，并将其组合为新的from gensim.models import KeyedVectors model = KeyedVectors.load_word2vec_format(GOOGLE_WORD2VEC_MODEL, limit=15, binary=True)实例，则可以重用KeyedVectors中的一个类，而不是简单的类gensim，然后以稍微不同的方式添加向量：

dict

...然后在您要添加的每个# instead of a {} dict word_vectors = KeyedVectors(model.vector_size) # re-use size from loaded model循环中...

word

然后您将有一个# instead of `word_vectors[word] = _SOMETHING_` word_vectors.add(word, model[word])，它是实际的word_vectors对象。当您可以通过简单的Python戳程序将其保存时，此时您最好使用内置的KeyedVectors和KeyedVectors，它们可能是在大型向量集上效率更高（通过将大量原始向量集保存为单独的文件，应与主文件一起保存）。例如：

save()

...

load()

Gensim内置的model.load函数和Python Pickle.load文件

1 个答案: