Question

全局目标：我正在使用NLTK和Gensim在Python中制作LDA产品评论模型。我想在不同的n-gram上运行它。

问题：对于unigrams来说，一切都很好，但是当我使用双字母组合运行时，我开始通过重复信息获取主题。例如，主题1可能包含：['good product', 'good value']，主题4可能包含：['great product', 'great value']。对于人类而言，这些显然传达了相同的信息，但显然'good product'和'great product'是不同的双字母。我如何通过算法确定'good product'和'great product'是否足够相似，因此我可以将其中一个出现的所有出现转换为另一个（可能是在语料库中更常出现的那个）？

我尝试了什么：我玩过WordNet的Synset树，运气不佳。事实证明good是一个＆＃39;形容词＆＃39;，但great是一个＆＃39;形容词卫星＆＃39;，因此返回None以获得路径相似性。我的思考过程是做以下事情：

词性标记句子
使用这些POS查找正确的Synset
计算两个Synsets的相似性
如果它们高于某个阈值，则计算两个单词的出现次数
将最少出现的单词替换为最常出现的单词

理想情况下，我想要一种可以确定good和great在我的语料库中的算法（可能是同时发生的意义上的）），因此它可以扩展到不是普通英语语言的一部分，但出现在我的语料库中，因此它可以扩展到n-gram（可能是Oracle和{{ 1}}在我的语料库中是同义词，或terrible和feature engineering相似）。

有关算法的建议，或建议让WordNet synset运行吗？

Answer 1

如果您打算使用WordNet，那么

问题1： Word Sense消歧（WSD），即如何自动确定要使用哪个synset？

>>> for i in wn.synsets('good','a'):
...     print i.name, i.definition
... 
good.a.01 having desirable or positive qualities especially those suitable for a thing specified
full.s.06 having the normally expected amount
good.a.03 morally admirable
estimable.s.02 deserving of esteem and respect
beneficial.s.01 promoting or enhancing well-being
good.s.06 agreeable or pleasing
good.s.07 of moral excellence
adept.s.01 having or showing knowledge and skill and aptitude
good.s.09 thorough
dear.s.02 with or in a close or intimate relationship
dependable.s.04 financially sound
good.s.12 most suitable or right for a particular purpose
good.s.13 resulting favorably
effective.s.04 exerting force or influence
good.s.15 capable of pleasing
good.s.16 appealing to the mind
good.s.17 in excellent physical condition
good.s.18 tending to promote physical well-being; beneficial to health
good.s.19 not forged
good.s.20 not left to spoil
good.s.21 generally admired

>>> for i in wn.synsets('great','a'):
...     print i.name, i.definition
... 
great.s.01 relatively large in size or number or extent; larger than others of its kind
great.s.02 of major significance or importance
great.s.03 remarkable or out of the ordinary in degree or magnitude or effect
bang-up.s.01 very good
capital.s.03 uppercase
big.s.13 in an advanced stage of pregnancy

让我们说你以某种方式得到了正确的意义，也许你尝试过这样的事情（https://github.com/alvations/pywsd），让我们说你得到了POS和同义词：

good.a.01具有理想或积极的品质，尤其是那些适合指定的东西 great.s.01大小，数量或范围相对较大;比其他同类产品更大

问题2：您如何比较2个同义词？

让我们尝试相似度函数，但你意识到它们没有给你任何分数：

>>> good = wn.synsets('good','a')[0]
>>> great = wn.synsets('great','a')[0]
>>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))
None
>>> print max(wn.wup_similarity(good,great), wn.wup_similarity(great, good))

>>> print max(wn.res_similarity(good,great,semcor_ic), wn.res_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1312, in res_similarity
    return synset1.res_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 738, in res_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.jcn_similarity(good,great,semcor_ic), wn.jcn_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1316, in jcn_similarity
    return synset1.jcn_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 759, in jcn_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
    return synset1.lin_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.lch_similarity(good,great), wn.lch_similarity(great, good))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1304, in lch_similarity
    return synset1.lch_similarity(synset2, verbose, simulate_root)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 638, in lch_similarity
    (self, other))
nltk.corpus.reader.wordnet.WordNetError: Computing the lch similarity requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.

让我们尝试一组不同的同义词，因为good同时包含satellite-adjective和adjective而great只有satellite，让我们选择最低的共同点分母：

good.s.13 resulting favorably
great.s.01 relatively large in size or number or extent; larger than others of its kind

您意识到在satellite-adjective：

之间进行比较时仍然没有相似性信息

>>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
    return synset1.lin_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1645, in _lcs_ic
    ic1 = information_content(synset1, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1666, in information_content
    raise WordNetError(msg % synset.pos)
nltk.corpus.reader.wordnet.WordNetError: Information content file has no entries for part-of-speech: s
>>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))None
None

现在看起来WordNet正在创造更多的问题，而不是在这里解决任何问题，让我们尝试另一种方法，让我们尝试单词聚类，参见http://en.wikipedia.org/wiki/Word-sense_induction

这时候我也放弃了回答OP发布的广泛而开放的问题，因为在聚类方面做了很多工作，这对像我这样的凡人来说是自动化=）

Answer 2

你说（重点补充）：

理想情况下，我想要一种算法，可以确定我的语料库中的优秀和伟大是相似的（可能是在共同发生的意义上）

您可以通过测量这些单词与其他单词（即共现）出现在同一句子中的频率来衡量单词相似度。为了捕获更多的语义相关性，您可以捕获搭配，也就是说，单词出现在单词邻域中的单词窗口中的频率。

This paper处理Word Sense Disambiguation（WSD），它使用搭配和周围的单词（共现）作为其特征空间的一部分。结果非常好，所以我猜你可以为你的问题使用相同的功能。

在Python中，您可以使用sklearn，尤其是您可能需要查看SVM（包含示例代码）以帮助您入门。

一般的想法将沿着这条线：

获取一对想要检查相似性的双字母
使用语料库，为每个二元组生成并置和共现功能
训练SVM以了解第一个二元组的功能
在其他项目符号出现时运行SVM（这里得到一些分数）
可以使用分数来确定两个双子座是否彼此相似

NLTK - 自动翻译相似的单词

2 个答案: