我在语料库上训练word2vec模型,然后查询模型。
这很好用,但我正在运行一个实验,需要针对不同的条件调用模型,为每个条件保存模型,查询每个条件的模型,然后将查询的输出保存到csv文件中,比如,进一步分析所有条件。
我研究过gensim文档并搜索过,但无法弄清楚该怎么做。
我问了gensim人,他们说,因为" most_similar"是一个python对象我可以用pickle保存它或保存为txt,csv,无论我想要什么格式。
听起来不错,但我不知道如何开始。这是我的代码 - 你能帮助我吗?填补空白"即使是一些简单的东西,我可以进一步研究并自己扩展?
#train the model
trained_model = gensim.models.Word2Vec(some hyperparamters)
#save the model in the format that is appropriate for querying by writing it to disk and call it stored_model
trained_model.save(some_filename)
#read in the stored model from disk and call it retrieved_model
retrieved_model = gensim.models.Word2Vec.load(some_filename)
#query the retrieved model
#each of these queries produces a tuple of 10 'word', cosine similarity pairs
retrieved_model.wv.most_similar(positive=['smartthings', 'amazon'], negative=['samsung'])
retrieved_model.wv.most_similar(positive=['light', 'nest'], negative=['hue'])
retrieved_model.wv.most_similar(positive=['shopping', 'new_york_times'], negative=['ebay'])
.
.
.
#store the results of all these queries in a csv so they can be analyzed.
?
答案 0 :(得分:1)
如我的评论中所述,您可以保存并加载这样的模型对象:
# Save model
filename = 'stored_model.wv' # Can be any arbitrary filename
trained_model.save(filename)
# Reload model
retrieved_model = gensim.models.Word2Vec.load(filename)
为了检索多个查询,我建议定义一个查询列表并对其进行迭代以检索所有结果。
# Define queries (this is the only user input required!)
my_queries = [{'positive' : ['smartthings','amazon'],
'negative' : ['samsung']},
{'positive' : ['light','nest'],
'negative' : ['hue']},
#<and so forth...>
]
# Initialize empty result list
query_results = []
# Collect query results
for query in my_queries:
result = retrieved_model.wv.most_similar(**query)
query_results.append(result)
最后,您可以使用结果列表以您想要的格式编写csv文件。可以构造文件的标题以表示查询。
# Open the file
with open("my_results.csv", "w") as outfile:
# Construct the header
header = []
for query in my_queries:
head = 'pos:'+'+'.join(query['positive'])+'__neg:'+'+'.join(query['negative'])
# First resulting head: 'pos:smartthings+amazon__neg:samsung'
header.append(head)
# Write the header
# Note the additional empty fields (,_,) because each head needs two columns
outfile.write(",_,".join(header)+",_\n")
# Write the second row to label the columns
outfile.write(",".join(["word,cos_sim" for i in range(len(header))])+'\n')
# Write the data
for i in range(len(query_results[0])):
row_results = [r[0]+','+str(r[1]) for r in query_results[i]]
outfile.write(",".join(row_results)+'\n')
请注意,这只有在每个查询检索相同数量的项目时才有效(默认情况下是这种情况,但可以使用topn
的{{1}}关键字参数进行更改。
答案 1 :(得分:0)
一种简单的方法可以编写如下:
vocab, vectors = model.wv.vocab, model.wv.vectors
# get node name and embedding vector index.
name_index = np.array([(v[0], v[1].index) for v in vocab.items()])
# init dataframe using embedding vectors and set index as node name
df = pd.DataFrame(vectors[name_index[:,1].astype(int)])
df.index = name_index[:, 0]
df.to_csv("embedding.csv")