为什么我不能在请求同义词时将wn.ADJ_SAT作为pos传递

时间:2014-11-13 03:41:49

标签: python nlp nltk wordnet

我知道wordnet有一个"adverb synset" type。我知道这是nltk中的synset类型枚举

from nltk.corpus import wordnet as wn
wn.ADJ_SAT
u's'

为什么我不能将它作为同义词的关键字传递给我?

>>> wn.synsets('dog', wn.ADJ_SAT)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1413, in synsets
    for form in self._morphy(lemma, p)
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1627, in _morphy
    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]
KeyError: u's'

2 个答案:

答案 0 :(得分:1)

自:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('able')
[Synset('able.a.01'), Synset('able.s.02'), Synset('able.s.03'), Synset('able.s.04')]
>>> wn.synsets('able', pos=wn.ADJ)
[Synset('able.a.01'), Synset('able.s.02'), Synset('able.s.03'), Synset('able.s.04')]
>>> wn.synsets('able', pos=wn.ADJ_SAT)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1413, in synsets
    for form in self._morphy(lemma, p)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1627, in _morphy
    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]
KeyError: u's'

https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1397,我们看到当您尝试从NLTK wordnet API检索同义词集时,POS限制会出现在调用self._morphy(lemma, p)函数的返回列表解析中:< / p>

def synsets(self, lemma, pos=None, lang='en'):
    """Load all synsets with a given lemma and part of speech tag.
    If no pos is specified, all synsets for all parts of speech
    will be loaded. 
    If lang is specified, all the synsets associated with the lemma name
    of that language will be returned.
    """
    lemma = lemma.lower()

    if lang == 'en':
        get_synset = self._synset_from_pos_and_offset
        index = self._lemma_pos_offset_map
        if pos is None:
            pos = POS_LIST
        return [get_synset(p, offset)
                for p in pos
                for form in self._morphy(lemma, p)
                for offset in index[form].get(p, [])]

如果我们从https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1573查看_morphy()函数。

 def _morphy(self, form, pos):
        # from jordanbg:
        # Given an original string x
        # 1. Apply rules once to the input to get y1, y2, y3, etc.
        # 2. Return all that are in the database
        # 3. If there are no matches, keep applying rules until you either
        #    find a match or you can't go any further

        exceptions = self._exception_map[pos]
        substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]

        def apply_rules(forms):
            return [form[:-len(old)] + new
                    for form in forms
                    for old, new in substitutions
                    if form.endswith(old)]

        def filter_forms(forms):
            result = []
            seen = set()
            for form in forms:
                if form in self._lemma_pos_offset_map:
                    if pos in self._lemma_pos_offset_map[form]:
                        if form not in seen:
                            result.append(form)
                            seen.add(form)
            return result

        # 0. Check the exception lists
        if form in exceptions:
            return filter_forms([form] + exceptions[form])

        # 1. Apply rules once to the input to get y1, y2, y3, etc.
        forms = apply_rules([form])

        # 2. Return all that are in the database (and check the original too)
        results = filter_forms([form] + forms)
        if results:
            return results

        # 3. If there are no matches, keep applying rules until we find a match
        while forms:
            forms = apply_rules(forms)
            results = filter_forms(forms)
            if results:
                return results

        # Return an empty list if we can't find anything
        return []

我们看到它从substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]检索一些替换规则,以便在检索以“基础”/“根”形式存储的Synsets之前执行某些形态缩减。 E.g。

>>> from nltk.corpus import wordnet as wn
>>> wn._morphy('dogs', 'n')
[u'dog']

如果我们查看MORPHOLOGICAL_SUBSTITUTIONS,我们会看到ADJ_SAT缺失,请参阅https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1609

MORPHOLOGICAL_SUBSTITUTIONS = {
    NOUN: [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'),
           ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'),
           ('men', 'man'), ('ies', 'y')],
    VERB: [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''),
           ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')],
    ADJ: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')],
    ADV: []}

因此,要防止这种情况发生,请在https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1609的第1609行之后添加此行:

MORPHOLOGICAL_SUBSTITUTIONS[ADJ_SAT] = MORPHOLOGICAL_SUBSTITUTIONS[ADJ]

为了概念验证:

>>> MORPHOLOGICAL_SUBSTITUTIONS = {
...     1: [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'),
...            ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'),
...            ('men', 'man'), ('ies', 'y')],
...     2: [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''),
...            ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')],
...     3: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')],
...     4: []}
>>> 
>>> MORPHOLOGICAL_SUBSTITUTIONS[5] = MORPHOLOGICAL_SUBSTITUTIONS[3]
>>> MORPHOLOGICAL_SUBSTITUTIONS
{1: [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], 2: [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')], 3: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 4: [], 5: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')]}

答案 1 :(得分:1)