如何从gensim中的Word2Vec模型中完全删除单词?

时间:2018-02-23 05:26:07

标签: python dictionary word2vec gensim del

给出一个模型,例如

from gensim.models.word2vec import Word2Vec


documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

texts = [d.lower().split() for d in documents]

w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)

可以从w2v词汇表中删除单词,例如

# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433  0.08862179  0.08601206  0.05281207 -0.00673626]

>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)

# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]

# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"

但是当我们在删除graph后对其他字词进行相似性时,我们会看到单词graph弹出,例如

>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]

如何从gensim中的Word2Vec模型中完全删除单词?

更新

回答@ vumaasha的评论:

  

您能否提供一些有关删除单词的详细信息

  • 让我说出我语料库中所有单词中的单词世界,以学习所有单词之间的密集关系。

  • 但是当我想要生成相似的单词时,它应该只来自特定于域的单词的一个子集。

  • 可以从.most_similar()生成足够多的内容然后过滤单词,但可以说特定域的空间很小,我可能正在寻找一个排名第1000的最相似的词,效率很低。

  • 如果从单词向量中完全删除单词会更好,那么.most_similar()单词将不会返回特定域之外的单词。

4 个答案:

答案 0 :(得分:6)

我写了一个函数,该函数从KeyedVectors中删除不在预定义单词列表中的单词。

def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = new_vectors
    w2v.index2entity = new_index2entity
    w2v.index2word = new_index2entity
    w2v.vectors_norm = new_vectors_norm

它基于Word2VecKeyedVectors重写与单词相关的所有变量。

用法:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")
  

[('啤酒',0.8409687876701355),
   ('lager',0.7733745574951172),
   (“啤酒”,0.71753990650177),
   ('drinks',0.668931245803833),
   (“ lagers”,0.6570086479187012),
   ('Yuengling_Lager',0.655455470085144),
   ('microbrew',0.6534324884414673),
   ('Brooklyn_Lager',0.6501551866531372),
   ('suds',0.6497018337249756),
   (“ brewed_beer”,0.6490240097045898)

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")
  

[('lagers',0.6570085287094116),
   (“葡萄酒”,0.6217695474624634),
   (“ bash”,0.20583480596542358),
   (“计算机”,0.06677375733852386),
   ('python',0.005948573350906372)]

答案 1 :(得分:2)

没有直接的方法来做你想要的。但是,你并没有完全迷失。方法most_similar在类WordEmbeddingsKeyedVectors中实现(请查看链接)。您可以查看此方法并根据需要进行修改。

下面显示的lines执行计算相似单词的实际逻辑,您需要将变量limited替换为与您感兴趣的单词对应的向量。然后你就完成了

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
        dists = dot(limited, mean)
        if not topn:
            return dists
best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)

更新

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]

如果你看到这一行,就意味着如果使用restrict_vocab它会限制词汇中的前n个单词,只有你按频率对词汇进行排序才有意义。如果您没有传递restrict_vocab,self.vectors_norm是有限的

方法most_similar调用另一个方法init_sims。这会初始化[self.vector_norm][4]的值,如下所示

        self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)

所以,你可以拾取你感兴趣的单词,准备他们的标准并用它来代替有限的单词。这应该工作

答案 2 :(得分:1)

请注意,这本身并不会修剪模型。它修剪相似性查询所基于的KeyedVectors对象。

假设您只想在模型中保留前5000个字。

wv = w2v_model.wv
words_to_trim = wv.index2word[5000:]
# In op's case 
# words_to_trim = ['graph'] 
ids_to_trim = [wv.vocab[w].index for w in words_to_trim]

for w in words_to_trim:
    del wv.vocab[w]

wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0)
wv.init_sims(replace=True)

for i in sorted(ids_to_trim, reverse=True):
    del(wv.index2word[i])

之所以能够完成此任务,是因为BaseKeyedVectors class包含以下属性:self.vectors,self.vectors_norm,self.vocab,self.vector_size,self.index2word。

这样做的好处是,如果使用save_word2vec_format()之类的方法编写KeyedVector,则文件会小得多。

答案 3 :(得分:0)

尝试并认为最直接的方法如下:

  1. 以文本文件格式获取Word2Vec嵌入。
  2. 标识与您要保留的单词向量相对应的行。
  3. 编写一个新的文本文件Word2Vec嵌入模型。
  4. 加载模型并享受(如果需要,另存为二进制文件,等等)...

我的示例代码如下:

line_no = 0 # line0 = header
numEntities=0
targetLines = []

with open(file_entVecs_txt,'r') as fp:
    header = fp.readline() # header

    while True:
        line = fp.readline()
        if line == '': #EOF
            break
        line_no += 1

        isLatinFlag = True
        for i_l, char in enumerate(line):
            if not isLatin(char): # Care about entity that is Latin-only
                isLatinFlag = False
                break
            if char==' ': # reached separator
                ent = line[:i_l]
                break

        if not isLatinFlag:
            continue

        # Check for numbers in entity
        if re.search('\d',ent):
            continue

        # Check for entities with subheadings '#' (e.g. 'ENTITY/Stereotactic_surgery#History')
        if re.match('^ENTITY/.*#',ent):
            continue

        targetLines.append(line_no)
        numEntities += 1

# Update header with new metadata
header_new = re.sub('^\d+',str(numEntities),header,count=1)

# Generate the file
txtWrite('',file_entVecs_SHORT_txt)
txtAppend(header_new,file_entVecs_SHORT_txt)

line_no = 0
ptr = 0
with open(file_entVecs_txt,'r') as fp:
    while ptr < len(targetLines):
        target_line_no = targetLines[ptr]

        while (line_no != target_line_no):
            fp.readline()
            line_no+=1

        line = fp.readline()
        line_no+=1
        ptr+=1
        txtAppend(line,file_entVecs_SHORT_txt)

仅供参考。失败尝试我尝试了@zsozso的方法(经过@Taegyung建议的np.array修改),使其整夜运行了至少12个小时,但仍然无法从限制集中获取新单词...)。这可能是因为我有很多实体...但是我的文本文件方法在一小时内可以工作。

失败的代码

# [FAILED] Stuck at Building new vocab...
def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    print('Building new vocab..')

    for i in range(len(w2v.vocab)):

        if (i%int(1e6)==0) and (i!=0):
            print(f'working on {i}')

        word = w2v.index2entity[i]
        vec = np.array(w2v.vectors[i])
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    print('Assigning new vocab')
    w2v.vocab = new_vocab
    print('Assigning new vectors')
    w2v.vectors = np.array(new_vectors)
    print('Assigning new index2entity, index2word')
    w2v.index2entity = new_index2entity
    w2v.index2word = new_index2entity
    print('Assigning new vectors_norm')
    w2v.vectors_norm = np.array(new_vectors_norm)