NLTK Lemmatizer, Extract meaningful words

时间:2018-09-18 19:45:00

标签: python-3.x nlp nltk lemmatization

Currently, I am going to create a machine learning based code that automatically maps categories.

I am going to do natural language processing before that.

There are several words list.

      sent ='The laughs you two heard were triggered 
             by memories of his own high j-flying 
             moist moisture moisturize moisturizing '.lower().split()

I made the following code. I referenced this url. NLTK: lemmatizer and pos_tag

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
def lemmatize_all(sentence):
    wnl = WordNetLemmatizer()
    for word, tag in pos_tag(word_tokenize(sentence)):
        if tag.startswith("NN"):
            yield wnl.lemmatize(word, pos='n')
        elif tag.startswith('VB'):
            yield wnl.lemmatize(word, pos='v')
        elif tag.startswith('JJ'):
            yield wnl.lemmatize(word, pos='a')



words = ' '.join(lemmatize_all(' '.join(sent)))

The resulting values are shown below.

laugh heard be trigger memory own high j-flying moist moisture moisturize moisturizing

I am satisfied with the following results.

laughs -> laugh 
were -> be
triggered -> trigger 
memories -> memory 
moist -> moist 

However, the following values are not satisfied.

heard -> heard 
j-flying -> j-flying 
moisture -> moisture 
moisturize -> moisturize 
moisturizing -> moisturizing 

Although it was better than the initial values, I would like the following results.

heard -> hear
j-flying -> fly
moisture -> moist
moisturize -> moist
moisturizing -> moist

If you have any other good way to extract meaningful words, please let me know. Thank you

2 个答案:

答案 0 :(得分:3)

TL; DR

当您使用的lemmatizer解决另一个问题时,这是lemmatizer无法满足您期望的XY问题。


长话

问:什么是引理?

  

语言学中的词法化(或词义化)是将单词的变体形式组合在一起的过程,以便可以将它们作为单个项目进行分析,由单词的词缀或字典形式进行识别。 -Wikipedia

问:什么是“字典形式”?

NLTK使用morphy算法,该算法使用WordNet作为“字典形式”的基础。

另请参见How does spacy lemmatizer works?。注意SpaCy引入了其他技巧来处理更多不规则单词。

问:为什么选择moisture -> moisturemoisturizing -> moisturizing

因为存在用于“保湿”和“保湿”的同义词集(“字典形式”的一种)

>>> from nltk.corpus import wordnet as wn

>>> wn.synsets('moisture')
[Synset('moisture.n.01')]
>>> wn.synsets('moisture')[0].definition()
'wetness caused by water'

>>> wn.synsets('moisturizing')
[Synset('humidify.v.01')]
>>> wn.synsets('moisturizing')[0].definition()
'make (more) humid'

问:如何获得moisture -> moist

不太有用。但是,也许可以尝试使用词干分析器(但不要期望太多)

>>> from nltk.stem import PorterStemmer

>>> porter = PorterStemmer()
>>> porter.stem("moisture")
'moistur'

>>> porter.stem("moisturizing")
'moistur'

问:那我怎么得到moisuturizing/moisuture -> moist?!

没有充分的方法可以做到这一点。但是,在尝试这样做之前,moisuturizing/moisuture -> moist的最终目的是什么。

真的有必要这样做吗?

如果您真的想要,可以尝试使用词向量并尝试查找最相似的词,但是词向量附带了其他一些警告。

问:请稍等,但是heard -> heard太荒谬了!!

是的,POS标记器未正确标记听到的声音。很有可能是因为该句子不是正确的句子,所以POS标签对于该句子中的单词是错误的:

>>> from nltk import word_tokenize, pos_tag
>>> sent
'The laughs you two heard were triggered by memories of his own high j-flying moist moisture moisturize moisturizing.'

>>> pos_tag(word_tokenize(sent))
[('The', 'DT'), ('laughs', 'NNS'), ('you', 'PRP'), ('two', 'CD'), ('heard', 'NNS'), ('were', 'VBD'), ('triggered', 'VBN'), ('by', 'IN'), ('memories', 'NNS'), ('of', 'IN'), ('his', 'PRP$'), ('own', 'JJ'), ('high', 'JJ'), ('j-flying', 'NN'), ('moist', 'NN'), ('moisture', 'NN'), ('moisturize', 'VB'), ('moisturizing', 'NN'), ('.', '.')]

我们看到heard被标记为NNS(一个名词)。如果我们将其形容为动词:

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('heard', pos='v')
'hear'

问:那我如何获得正确的POS标签?!

可能是使用SpaCy,您可以获得('heard', 'VERB')

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> sent
'The laughs you two heard were triggered by memories of his own high j-flying moist moisture moisturize moisturizing.'
>>> doc = nlp(sent)
>>> [(word.text, word.pos_) for word in doc]
[('The', 'DET'), ('laughs', 'VERB'), ('you', 'PRON'), ('two', 'NUM'), ('heard', 'VERB'), ('were', 'VERB'), ('triggered', 'VERB'), ('by', 'ADP'), ('memories', 'NOUN'), ('of', 'ADP'), ('his', 'ADJ'), ('own', 'ADJ'), ('high', 'ADJ'), ('j', 'NOUN'), ('-', 'PUNCT'), ('flying', 'VERB'), ('moist', 'NOUN'), ('moisture', 'NOUN'), ('moisturize', 'NOUN'), ('moisturizing', 'NOUN'), ('.', 'PUNCT')]

但是请注意,在这种情况下,SpaCy得到了('moisturize', 'NOUN'),NLTK得到了('moisturize', 'VB')

问:但是我不能通过SpaCy获得moisturize -> moist吗?

让我们不要回到定义引理的起点。简而言之:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> sent
'The laughs you two heard were triggered by memories of his own high j-flying moist moisture moisturize moisturizing.'
>>> doc = nlp(sent)
>>> [word.lemma_ for word in doc]
['the', 'laugh', '-PRON-', 'two', 'hear', 'be', 'trigger', 'by', 'memory', 'of', '-PRON-', 'own', 'high', 'j', '-', 'fly', 'moist', 'moisture', 'moisturize', 'moisturizing', '.']

另请参阅How does spacy lemmatizer works?

问:好的,好的。我找不到moisturize -> moist ... POS标签对heard -> hear来说并不完美。但是为什么我不能得到j-flying -> fly

回到为什么需要转换j-flying -> fly 的问题,有一些反例说明为什么您不想分离看起来像化合物的东西。

例如:

  • Classical-sounding应该去sound吗?
  • X-fitting应该去fit吗?
  • crash-landing应该去landing吗?

取决于应用程序的最终目的是什么,可能有必要也可能没有必要将令牌转换为所需的形式。

问:那么,提取有意义的单词的好方法是什么?

我听起来像是破记录,但这取决于您的最终目标是什么?

如果您的目标是真正理解单词的含义,那么您必须问自己一个问题,“含义是什么?”

单个单词的上下文含义之外吗?还是它具有所有可能出现的上下文中的含义之和。

Au currant,最先进的技术基本上将所有含义视为一个浮点数数组,而这些浮点数数组之间的比较就是赋予其含义的含义。但这真的是目的还是手段? (双关语意)。

问:为什么我的问题多于答案?

欢迎来到起源于哲学(例如计算机科学)的计算语言学世界。自然语言处理通常被称为计算语言学的应用


令人回味的食物

问:词干比词干更好吗?

A:没有确定的答案。 (参见Stemmers vs Lemmatizers

答案 1 :(得分:0)

Lemmatization is not an easy task. You should not expect perfect results. Yiu can however see if you like the results of other lemmatization libraries better.

Spacy is an obvious Python option to evaluate. Stanford core nlp is another(JVM based and GPLed).

There are oher options, none will be perfect.