Question

因此，我想使用词嵌入功能来获得一些方便的，有用的余弦相似度值。创建模型并检查单词“ not”（在我提供给模型的数据中）的相似性后，它告诉我单词不在词汇表中。

为什么找不到“ not”一词的相似性？

描述数据如下：
[[[不是”，“仅”，“做”，“角度”，“制作”，“关节”，“更强”，“他们”，“也”，“提供”，“更多”，“一致”， “直”，“角”，“辛普森”，“浓密”，“优惠”，“一个”，“宽”，“多种”，“中”，“角”，“中”，“各种”，“尺寸” ”，“和”，“厚度”，“到”，“句柄”，“轻巧”，“工作”，“或”，“项目”，“位置”，“一个”，“结构”，“联系”， '是'，'需要'，'一些'，'可以'，'被'，'弯曲'，'倾斜'，'到'，'匹配'，'那个'，'项目'，'为'，'户外”，“项目”，“或”，“那些”，“哪里”，“水分”，“是”，“存在”，“使用”，“我们的”，“ zmax”，“镀锌”，“连接器”， “其中”，“提供”，“额外”，“阻力”，“反对”，“腐蚀”，“外观”，“用于”，“ a”，“ z”，“在”，“该”，“末端” '，'of'，'the'，'model'，'numberversatile'，'connector'，'for'，'various'，'connections，'and'，'home'，'repair'，'projectsstronger'， “比”，“成角度”，“钉上”，“或”，“螺钉”，“紧固”，“ alonehelp”，“确保”，“接头”，“是”，“一致”，“直”，“和” '，'strongdimensions'，'in'，'x'，'in'，'x'，'inmade '，'从'，'量规'，'镀锌'，'用于'，'附加'，'腐蚀'，'电阻安装'，'具有'，'d'，'普通'，'钉子'，'或'， 'x'，'in'，'strongdrive'，'sd'，'screws']]

请注意，我已经尝试将数据作为单独的句子而不是单独的单词提供。

def word_vec_sim_sum(row):
    description = row.product_description.split()
    description_embedding = gensim.models.Word2Vec([description], size=150,
        window=10,
        min_count=2,
        workers=10,
        iter=10)       
    print(description_embedding.wv.most_similar(positive="not"))

Answer 1

您需要降低min_count。

从documentation： min_count（整数，可选）–忽略总频率低于此频率的所有单词。在您提供的数据中，"not"出现一次，因此它被忽略。通过将min_count设置为1，它可以工作。

import gensim as gensim

data = [['not', 'only', 'do', 'angles', 'make', 'joints', 'stronger', 'they', 'also', 'provide', 'more', 'consistent',
         'straight', 'corners', 'simpson', 'strongtie', 'offers', 'a', 'wide', 'variety', 'of', 'angles', 'in',
         'various', 'sizes', 'and', 'thicknesses', 'to', 'handle', 'lightduty', 'jobs', 'or', 'projects', 'where', 'a',
         'structural', 'connection', 'is', 'needed', 'some', 'can', 'be', 'bent', 'skewed', 'to', 'match', 'the',
         'project', 'for', 'outdoor', 'projects', 'or', 'those', 'where', 'moisture', 'is', 'present', 'use', 'our',
         'zmax', 'zinccoated', 'connectors', 'which', 'provide', 'extra', 'resistance', 'against', 'corrosion', 'look',
         'for', 'a', 'z', 'at', 'the', 'end', 'of', 'the', 'model', 'numberversatile', 'connector', 'for', 'various',
         'connections', 'and', 'home', 'repair', 'projectsstronger', 'than', 'angled', 'nailing', 'or', 'screw',
         'fastening', 'alonehelp', 'ensure', 'joints', 'are', 'consistently', 'straight', 'and', 'strongdimensions',
         'in', 'x', 'in', 'x', 'inmade', 'from', 'gauge', 'steelgalvanized', 'for', 'extra', 'corrosion',
         'resistanceinstall', 'with', 'd', 'common', 'nails', 'or', 'x', 'in', 'strongdrive', 'sd', 'screws']]


def word_vec_sim_sum(row):
    description = row
    description_embedding = gensim.models.Word2Vec([description], size=150,
                                                   window=10,
                                                   min_count=1,
                                                   workers=10,
                                                   iter=10)
    print(description_embedding.wv.most_similar(positive="not"))


word_vec_sim_sum(data[0])

输出：

[('do', 0.21456070244312286), ('our', 0.1713767945766449), ('can', 0.1561305820941925), ('repair', 0.14236785471439362), ('screw', 0.1322808712720871), ('offers', 0.13223429024219513), ('project', 0.11764446645975113), ('against', 0.08542445302009583), ('various', 0.08226475119590759), ('use', 0.08193354308605194)]

训练gensim word2vec模型后，单词不在词汇表中，为什么？

1 个答案: