为什么NLTK库中有不同的Lemmatizers?

时间:2016-11-09 18:21:03

标签: python nlp nltk lemmatization

>> from nltk.stem import WordNetLemmatizer as lm1
>> from nltk import WordNetLemmatizer as lm2
>> from nltk.stem.wordnet import WordNetLemmatizer as lm3

对于我来说,这三个作品都是以同样的方式,但只是为了确认,它们是否提供了不同的东西?

2 个答案:

答案 0 :(得分:5)

不,他们没有什么不同,他们都是一样的。

from nltk.stem import WordNetLemmatizer as lm1
from nltk import WordNetLemmatizer as lm2
from nltk.stem.wordnet import WordNetLemmatizer as lm3

lm1 == lm2 
>>> True


lm2 == lm3 
>>> True


lm1 == lm3 
>>> True

erip更正为什么会发生这种情况是因为:

该类(WordNetLemmatizer)原创于nltk.stem.wordnet,因此您可以from nltk.stem.wordnet import WordNetLemmatizer as lm3

这也是nltk __init__.py file中的导入,因此您可以执行from nltk import WordNetLemmatizer as lm2

并且也会导入__init__.py nltk.stem模块,以便您可以执行from nltk.stem import WordNetLemmatizer as lm1

答案 1 :(得分:1)

答案:他们都是一样的。

inspect有用的工具,用于检查对象是否相同

>>> import inspect
>>> from nltk.stem import WordNetLemmatizer as wnl1
>>> from nltk.stem.wordnet import WordNetLemmatizer as wnl2
>>> inspect.getfile(wnl1)
'/Library/Python/2.7/site-packages/nltk/stem/wordnet.pyc'
# They come from the same file:
>>> inspect.getfile(wnl1) == inspect.getfile(wnl2)
True
>>> print inspect.getdoc(wnl1)
WordNet Lemmatizer

Lemmatize using WordNet's built-in morphy function.
Returns the input word unchanged if it cannot be found in WordNet.

    >>> from nltk.stem import WordNetLemmatizer
    >>> wnl = WordNetLemmatizer()
    >>> print(wnl.lemmatize('dogs'))
    dog
    >>> print(wnl.lemmatize('churches'))
    church
    >>> print(wnl.lemmatize('aardwolves'))
    aardwolf
    >>> print(wnl.lemmatize('abaci'))
    abacus
    >>> print(wnl.lemmatize('hardrock'))
    hardrock

您也可以查看源代码:

>>> print inspect.getsource(wnl1)
class WordNetLemmatizer(object):
    """
    WordNet Lemmatizer

    Lemmatize using WordNet's built-in morphy function.
    Returns the input word unchanged if it cannot be found in WordNet.

        >>> from nltk.stem import WordNetLemmatizer
        >>> wnl = WordNetLemmatizer()
        >>> print(wnl.lemmatize('dogs'))
        dog
        >>> print(wnl.lemmatize('churches'))
        church
        >>> print(wnl.lemmatize('aardwolves'))
        aardwolf
        >>> print(wnl.lemmatize('abaci'))
        abacus
        >>> print(wnl.lemmatize('hardrock'))
        hardrock
    """

    def __init__(self):
        pass

    def lemmatize(self, word, pos=NOUN):
        lemmas = wordnet._morphy(word, pos)
        return min(lemmas, key=len) if lemmas else word

    def __repr__(self):
        return '<WordNetLemmatizer>'

# They have the same source code too:
>>> print inspect.getsource(wnl1) == inspect.getsource(wnl2)
True

WordNetLemmatizer的NLTK导入结构如下所示:

\nltk
    __init__.py
    \stem.
        __init__.py  
        wordnet.py     # This is where WordNetLemmatizer code resides.

我们从WordNetLemmatizer位于nltk.stem.wordnet.py https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L15的最低位置开始,您可以这样做:

from nltk.stem.wordnet import WordNetLemmatizer

从nltk.stem。 init .py,我们在https://github.com/nltk/nltk/blob/develop/nltk/stem/init.py#L30看到上面的导入,允许nltk.stem访问WordNetLemmatizer,以便您可以

from nltk.stem import WordNetLemmatizer

nltk.__init__.py我们看到:

from nltk.stem import *

这允许最高级nltk导入访问nltk.stem有权访问的所有内容。所以在顶层nltk,我们可以这样做:

from nltk import WordNetLemmatizer

有一点需要注意,它的 NOT 始终是指具有相同名称的对象/模块引用NLTK中的同一对象的情况,例如:

>>> from nltk.corpus import wordnet as wn1
>>> from nltk.corpus.reader import wordnet as wn2
>>> wn1 == wn2
False

>>> wn1.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]

>>> wn2.synsets('dog')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'synsets'

第一个wordnet wn1是一个LazyCorpusLoader对象,它会打开nltk_data中的wordnet文件,并允许您访问同义词:https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L246

第二个wn2是位于wordnet.py的{​​{1}}文件本身:https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py

在以下情况下变得更加棘手:

nltk.corpus.wordnet.py

对于>>> from nltk.corpus import wordnet as wn1 >>> from nltk.corpus.reader import wordnet as wn2 >>> from nltk.stem import wordnet as wn3 >>> wn3 == wn1 False >>> wn3 == wn2 False ,它指的是包含wn3的文件nltk.stem.wordnet.py,它与wordnet语料库对象或用于wordnet的语料库阅读器无关。 / p>