我正在开发一个简单的NLP项目,我正在寻找一个文本和一个词,在文本中找到最可能的那个词。
Python中是否有任何wsd算法的实现?目前尚不清楚NLTK中是否有某些东西可以帮助我。即使是像莱斯克算法那样天真的实现,我也会很开心。
我读过像Word sense disambiguation in NLTK Python这样的类似问题,但他们只提供了一本关于NLTK的书,这本书并不是WSD问题。
答案 0 :(得分:9)
简而言之: https://github.com/alvations/pywsd
长期以来:WSD采用了无穷无尽的技术,包括需要大量GPU功能的思维爆破机技术,只需使用wordnet中的信息,甚至只使用频率,请参阅{{3 }。
让我们从允许可选词干的简单lesk算法开始,参见http://dl.acm.org/citation.cfm?id=1459355:
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer
from itertools import chain
bank_sents = ['I went to the bank to deposit my money',
'The river bank was full of dead fishes']
plant_sents = ['The workers at the industrial plant were overworked',
'The plant was no longer bearing flowers']
ps = PorterStemmer()
def lesk(context_sentence, ambiguous_word, pos=None, stem=True, hyperhypo=True):
max_overlaps = 0; lesk_sense = None
context_sentence = context_sentence.split()
for ss in wn.synsets(ambiguous_word):
# If POS is specified.
if pos and ss.pos is not pos:
continue
lesk_dictionary = []
# Includes definition.
lesk_dictionary+= ss.definition.split()
# Includes lemma_names.
lesk_dictionary+= ss.lemma_names
# Optional: includes lemma_names of hypernyms and hyponyms.
if hyperhypo == True:
lesk_dictionary+= list(chain(*[i.lemma_names for i in ss.hypernyms()+ss.hyponyms()]))
if stem == True: # Matching exact words causes sparsity, so lets match stems.
lesk_dictionary = [ps.stem(i) for i in lesk_dictionary]
context_sentence = [ps.stem(i) for i in context_sentence]
overlaps = set(lesk_dictionary).intersection(context_sentence)
if len(overlaps) > max_overlaps:
lesk_sense = ss
max_overlaps = len(overlaps)
return lesk_sense
print "Context:", bank_sents[0]
answer = lesk(bank_sents[0],'bank')
print "Sense:", answer
print "Definition:",answer.definition
print
print "Context:", bank_sents[1]
answer = lesk(bank_sents[1],'bank','n')
print "Sense:", answer
print "Definition:",answer.definition
print
print "Context:", plant_sents[0]
answer = lesk(plant_sents[0],'plant','n', True)
print "Sense:", answer
print "Definition:",answer.definition
print
除了类似lesk的算法之外,人们尝试了不同的相似性度量,这是一个很好但过时但仍然有用的调查:http://en.wikipedia.org/wiki/Lesk_algorithm
答案 1 :(得分:2)
您可以尝试使用NLTK中包含的WordNet获取每个单词的第一感,使用以下简短代码:
from nltk.corpus import wordnet as wn
def get_first_sense(word, pos=None):
if pos:
synsets = wn.synsets(word,pos)
else:
synsets = wn.synsets(word)
return synsets[0]
best_synset = get_first_sense('bank')
print '%s: %s' % (best_synset.name, best_synset.definition)
best_synset = get_first_sense('set','n')
print '%s: %s' % (best_synset.name, best_synset.definition)
best_synset = get_first_sense('set','v')
print '%s: %s' % (best_synset.name, best_synset.definition)
将打印:
bank.n.01: sloping land (especially the slope beside a body of water) set.n.01: a group of things of the same kind that belong together and are so used put.v.01: put into a certain place or abstract location
令人惊讶的是,这种方法效果很好,因为第一种感觉通常主宰其他感官。
答案 2 :(得分:0)
对于Python中的WSD,您可以尝试在NLTK或Gensim库中使用Wordnet绑定。那些构建块在那里,但是可能在你身上开发完整的算法。
例如,使用Wordnet,您可以实现Simplified Lesk算法,如Wikipedia entry中所述。