我使用Gensim创建了一个word2vec字典。我想用根词替换我的文本语料库。 有没有办法用根词来替换文本数据语料库。
EG。建筑是我的根词,我在词典中有相似之处。我希望将所有类似的单词替换为我的原始文本语料库中具有相似度的构建。
数据框列中的示例数据
canara bank aon china bldng queens rd centeal central
des voeux rd west hk unit f kwan yick bldng phase central western
formula growth asia limited suite chinachem tower connaught rd central
bangkok bank public company limited central district branch des voeux rd central cenrta
相似性
model.most_similar("building")
[('bu', 0.762892484664917),
('bldg', 0.7351159453392029),
('bl', 0.7237456440925598),
('building.', 0.7153196334838867),
('buliding', 0.6988817453384399),
('bld', 0.6966143846511841),
('bldng', 0.663501501083374),
('bdg', 0.6504702568054199),
('bd', 0.6480772495269775),
('blog', 0.6432161331176758)]
model.most_similar("ltd")
[('limited', 0.7886955142021179),
('limi', 0.6512018442153931),
('limite', 0.6031635999679565),
('wilford', 0.5938706994056702),
('lt', 0.583463728427887),
('lighttech', 0.5828145146369934),
('rmc', 0.5821658372879028),
('tomoike', 0.5752800703048706),
('jd', 0.5751883387565613),
('nxp', 0.5725069046020508)]
词典
import gensim
from gensim import corpora,similarities,models
class AccCorpus(object):
def __init__(self):
self.path = ''
def __iter__(self):
for sentence in data["Adj_Addr"]:
yield [word.lower() for word in sentence.split()]
def build_corpus():
model = gensim.models.word2vec.Word2Vec(alpha=0.025, min_alpha=0.025,window=2,sg=2)
sentences = AccCorpus()
model.build_vocab(sentences)
for epoch in range(1):
model.train(sentences,total_examples=model.corpus_count, epochs=model.iter)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
model_name = "word2vec_model"
model.save(model_name)
return model
model=build_corpus()
答案 0 :(得分:0)
据我所知,Gensim没有提供这样做的功能。
我的回答依赖于一些重要的假设,但现在就是这样。
<强>假设:强>
root_words
的变量中有一个预定义根词的列表。鉴于上面的示例,这看起来像root_words = ["building", "ltd", ...]
data
中的令牌列表,例如data["Adj_Addr"]
的一行可能看起来像['bldg','wilford','bld'] /
threshold = 0.6 # define a threshold
similar_words = {word:{} for word in root_words} # create a dictionary from root_words
for word in root_words:
"""
Loop over all words in root_words and create
a temp dict with values above the threshold using the
model.most_similar() method
"""
temp_dict = dict(model.most_similar(word))
temp_dict = {k:v for k,v in temp_dict.items() if v > threshold}
# append the temp dict to the similar_words dict
similar_words[word] = temp_dict
def replace_words(text):
"""
1.Loop over every word in the text (text is one row in the data
2.If the word is in root_words, simply append it to temp_text
3.If not, then loop over all words in the similar_words dict, and
check if the current word is in one of the sub dictionaries - if so,
append the root_word to the temp_text
4. Use the flags so we don't miss out on any words (e.g. there
may be words that are not in the root_words list or in the
similar_words sub dictionaries
5. Return temp_text
"""
temp_text = []
for word in text:
in_root_words_flag = False
found_root_flag = False
if word in root_words:
temp_text.append(word)
in_root_words_flag = True
else:
for root_word in similar_words:
if word in similar_words[root_word]:
temp_text.append(root_word)
found_root_flag = True
if in_root_words_flag == False and found_root_flag == False:
temp_text.append(word)
return temp_text
# apply the function above to your text data and create a new column
data['replaced_words'] = data["Adj_Addr"].apply(replace_words)
当然,该方法有点复杂,依赖于假设,并且用于循环密集(低效)。但它应该有用。