我正在尝试获取英语单词的基本英语单词,该单词是从其基本形式修改的。这个问题已在这里提出,但我没有看到正确的答案,所以我试图这样说。我尝试了两个来自NLTK包的词干器和一个词形变换器,它们是搬运器,干扰器,雪球器和wordnet lemmatiser。
我试过这段代码:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
words = ['arrival','conclusion','ate']
for word in words:
print "\n\nOriginal Word =>", word
print "porter stemmer=>", PorterStemmer().stem(word)
snowball_stemmer = SnowballStemmer("english")
print "snowball stemmer=>", snowball_stemmer.stem(word)
print "WordNet Lemmatizer=>", WordNetLemmatizer().lemmatize(word)
这是我得到的输出:
Original Word => arrival
porter stemmer=> arriv
snowball stemmer=> arriv
WordNet Lemmatizer=> arrival
Original Word => conclusion
porter stemmer=> conclus
snowball stemmer=> conclus
WordNet Lemmatizer=> conclusion
Original Word => ate
porter stemmer=> ate
snowball stemmer=> ate
WordNet Lemmatizer=> ate
但我想要这个输出
Input : arrival
Output: arrive
Input : conclusion
Output: conclude
Input : ate
Output: eat
我怎样才能做到这一点?有没有可用的工具?这称为形态分析。我知道这一点,但必须有一些工具已经实现了这一点。感谢帮助:)
首次修改
我试过这段代码
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn
query = "The Indian economy is the worlds tenth largest by nominal GDP and third largest by purchasing power parity"
def is_noun(tag):
return tag in ['NN', 'NNS', 'NNP', 'NNPS']
def is_verb(tag):
return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
def is_adverb(tag):
return tag in ['RB', 'RBR', 'RBS']
def is_adjective(tag):
return tag in ['JJ', 'JJR', 'JJS']
def penn_to_wn(tag):
if is_adjective(tag):
return wn.ADJ
elif is_noun(tag):
return wn.NOUN
elif is_adverb(tag):
return wn.ADV
elif is_verb(tag):
return wn.VERB
return wn.NOUN
tags = nltk.pos_tag(word_tokenize(query))
for tag in tags:
wn_tag = penn_to_wn(tag[1])
print tag[0]+"---> "+WordNetLemmatizer().lemmatize(tag[0],wn_tag)
在这里,我尝试使用wordnet lemmatizer提供适当的标签。这是输出:
The---> The
Indian---> Indian
economy---> economy
is---> be
the---> the
worlds---> world
tenth---> tenth
largest---> large
by---> by
nominal---> nominal
GDP---> GDP
and---> and
third---> third
largest---> large
by---> by
purchasing---> purchase
power---> power
parity---> parity
仍然,像"到达"和"结论"不会用这种方法处理。对此有什么解决方案吗?
答案 0 :(得分:2)
好的,所以...对于“ate”这个词,我认为你正在寻找NodeBox::Linguistics。
print en.verb.present("gave")
>>> give
我并不完全理解为什么你想要动词或“到达”而不是“结论”。
答案 1 :(得分:0)
试用word_stemmer
包,从here克隆并执行pip install -e word_forms
。
from word_forms.word_forms import get_word_forms
get_word_forms('conclusion')
# gives:
{'a': {'conclusive'},
'n': {'conclusion', 'conclusions', 'conclusivenesses', 'conclusiveness'},
'r': {'conclusively'},
'v': {'concludes', 'concluded', 'concluding', 'conclude'}}
在您的情况下,您希望从 n oun单词表单中获取 v erb表单。