我使用gensim从一组文档构建字典。每个文档都是令牌列表。这是我的代码
def constructModel(self, docTokens):
""" Given document tokens, constructs the tf-idf and similarity models"""
#construct dictionary for the BOW (vector-space) model : Dictionary = a mapping between words and their integer ids = collection of (word_index,word_string) pairs
#print "dictionary"
self.dictionary = corpora.Dictionary(docTokens)
# prune dictionary: remove words that appear too infrequently or too frequently
print "dictionary size before filter_extremes:",self.dictionary#len(self.dictionary.values())
#self.dictionary.filter_extremes(no_below=1, no_above=0.9, keep_n=100000)
#self.dictionary.compactify()
print "dictionary size after filter_extremes:",self.dictionary
#construct the corpus bow vectors; bow vector = collection of (word_id,word_frequency) pairs
corpus_bow = [self.dictionary.doc2bow(doc) for doc in docTokens]
#construct the tf-idf model
self.model = models.TfidfModel(corpus_bow,normalize=True)
corpus_tfidf = self.model[corpus_bow] # first transform each raw bow vector in the corpus to the tfidf model's vector space
self.similarityModel = similarities.MatrixSimilarity(corpus_tfidf) # construct the term-document index
我的问题是如何将新文档(令牌)添加到此词典并进行更新。我搜索了gensim文档,但我没有找到解决方案
答案 0 :(得分:6)
有关如何在gensim网页here
上执行此操作的文档这样做的方法是使用新文档创建另一个字典,然后合并它们。
from gensim import corpora
dict1 = corpora.Dictionary(firstDocs)
dict2 = corpora.Dictionary(moreDocs)
dict1.merge_with(dict2)
根据文档,这将把“相同的标记映射到相同的ID并将新标记映射到新的ID”。
答案 1 :(得分:0)
您可以使用add_documents
方法:
from gensim import corpora
text = [["aaa", "aaa"]]
dictionary = corpora.Dictionary(text)
dictionary.add_documents([['bbb','bbb']])
print(dictionary)
运行上面的代码后,你会得到这个:
Dictionary(2 unique tokens: ['aaa', 'bbb'])
阅读document了解更多详情。
答案 2 :(得分:0)
您可以只使用gensim.models.keyedvectors
中的keyedvectors。它们非常易于使用。
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
w2v = WordEmbeddingsKeyedVectors(50) # 50 = vec length
w2v.add(new_words, their_new_vecs)
AND (如果您已经使用gensim.models.Word2Vec
建立了模型,则可以执行此操作。假设我想添加带有随机向量的令牌<UKN>
。
model.wv["<UNK>"] = np.random.rand(100) # 100 is the vectors length
完整的示例如下:
import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load("text8") # load dataset as iterable
model = Word2Vec(dataset)
model.wv["<UNK>"] = np.random.rand(100)