Question

假设我有一个众所周知的短语拉（列表），例如： {“我爱你”，“你的母亲是......”，“我想我怀孕了”......}让我们说一下这样的1000个。现在我希望用户在文本框中输入自由文本，并使用某种NLP引擎来消化文本，并从拉动中找到10个最相关的短语，这些短语可能与文本有关。

我认为最简单的实现可以用文字来表达。每次挑选一个单词并以某种方式寻找相似之处。不确定哪个？
最令我害怕的是我必须支持的词汇量。我是某种演示的单一开发者，我不喜欢在表格中填写单词的想法......
我正在寻找免费的NLP引擎。我对它所写的语言不可知，但它必须是免费的 - 不是某种通过API调用收费的在线服务..

Answer 1

TextBlob和ConeptNet似乎不仅仅是这个问题的充分解决方案！

Answer 2

TextBlob是一个易于使用的Python NLP库，它是免费和开源的（根据许可的MIT许可证授权）。它为优秀的NLTK和pattern库提供了一个很好的包装。

解决问题的一个简单方法是从给定的文本中提取名词短语。

以下是TextBlob docs的一个例子。

from text.blob import TextBlob

text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

blob = TextBlob(text)
print(blob.noun_phrases)
# => ['titular threat', 'blob', 'ultimate movie monster', ...]

这可能是一个起点。从那里你可以尝试其他方法，例如评论中提到的相似性方法或TF-IDF。 TextBlob还可以使名词短语提取变得容易swap models。

完全披露：我是TextBlob的作者。

适合这项工作的最佳开源/免费NLP引擎

2 个答案: