使用pandas数据帧中的word2vec相似性字典在文本语料库中替换单词

时间:2018-01-03 08:28:06

标签: python pandas string-matching word2vec gensim

我使用Gensim创建了一个word2vec字典。我想用根词替换我的文本语料库。 有没有办法用根词来替换文本数据语料库。

EG。建筑是我的根词,我在词典中有相似之处。我希望将所有类似的单词替换为我的原始文本语料库中具有相似度的构建

数据框列中的示例数据

canara bank aon china bldng queens rd centeal central
des voeux rd west hk unit f kwan yick bldng phase central western
formula growth asia limited suite chinachem tower connaught rd central
bangkok bank public company limited central district branch des voeux rd central cenrta

相似性

  model.most_similar("building")
    [('bu', 0.762892484664917),
     ('bldg', 0.7351159453392029),
     ('bl', 0.7237456440925598),
     ('building.', 0.7153196334838867),
     ('buliding', 0.6988817453384399),
     ('bld', 0.6966143846511841),
     ('bldng', 0.663501501083374),
     ('bdg', 0.6504702568054199),
     ('bd', 0.6480772495269775),
     ('blog', 0.6432161331176758)]

model.most_similar("ltd")
[('limited', 0.7886955142021179),
 ('limi', 0.6512018442153931),
 ('limite', 0.6031635999679565),
 ('wilford', 0.5938706994056702),
 ('lt', 0.583463728427887),
 ('lighttech', 0.5828145146369934),
 ('rmc', 0.5821658372879028),
 ('tomoike', 0.5752800703048706),
 ('jd', 0.5751883387565613),
 ('nxp', 0.5725069046020508)]

词典

import gensim
from gensim import corpora,similarities,models
class AccCorpus(object):

    def __init__(self):
        self.path = ''


    def __iter__(self):
        for sentence in data["Adj_Addr"]:
            yield [word.lower() for word in sentence.split()]


def build_corpus():
    model = gensim.models.word2vec.Word2Vec(alpha=0.025, min_alpha=0.025,window=2,sg=2)
    sentences = AccCorpus()
    model.build_vocab(sentences)
    for epoch in range(1):
        model.train(sentences,total_examples=model.corpus_count, epochs=model.iter)
        model.alpha -= 0.002  # decrease the learning rate
        model.min_alpha = model.alpha  # fix the learning rate, no decay

    model_name = "word2vec_model"
    model.save(model_name)
    return model
model=build_corpus()

1 个答案:

答案 0 :(得分:0)

据我所知,Gensim没有提供这样做的功能。

我的回答依赖于一些重要的假设,但现在就是这样。

<强>假设:

  1. 您在名为root_words的变量中有一个预定义根词的列表。鉴于上面的示例,这看起来像root_words = ["building", "ltd", ...]
  2. 您的文字数据被格式化为数据框data中的令牌列表,例如data["Adj_Addr"]的一行可能看起来像['bldg','wilford','bld']
  3. /

    threshold = 0.6 # define a threshold
    similar_words = {word:{} for word in root_words} # create a dictionary from root_words
    for word in root_words:
        """        
        Loop over all words in root_words and create
        a temp dict with values above the threshold using the
        model.most_similar() method
        """
        temp_dict = dict(model.most_similar(word))
        temp_dict = {k:v for k,v in temp_dict.items() if v > threshold}
        # append the temp dict to the similar_words dict
        similar_words[word] = temp_dict
    
    
    def replace_words(text):
        """
        1.Loop over every word in the text (text is one row in the data 
        2.If the word is in root_words, simply append it to temp_text
        3.If not, then loop over all words in the similar_words dict, and
        check if the current word is in one of the sub dictionaries - if so,
        append the root_word to the temp_text
        4. Use the flags so we don't miss out on any words (e.g. there
        may be words that are not in the root_words list or in the
        similar_words sub dictionaries
        5. Return temp_text
        """
        temp_text = []
        for word in text:
            in_root_words_flag = False
            found_root_flag = False
    
            if word in root_words:
                temp_text.append(word)
                in_root_words_flag = True
    
            else:
                for root_word in similar_words:
                    if word in similar_words[root_word]:
                        temp_text.append(root_word)
                        found_root_flag = True
    
            if in_root_words_flag == False and found_root_flag == False:
                temp_text.append(word)
    
        return temp_text
    
    # apply the function above to your text data and create a new column
    data['replaced_words'] = data["Adj_Addr"].apply(replace_words)
    

    当然,该方法有点复杂,依赖于假设,并且用于循环密集(低效)。但它应该有用。