迷路""实例通过Python使用NTLK在WordNet 3.0中迭代同义词时的关系

时间:2015-01-25 20:51:37

标签: python nlp nltk wordnet

由于某些原因,我需要迭代WordNet3.0中的所有名词同义词,并在我的程序中使它们成为树结构。

但是当我通过下面列出的代码尝试这个时

from nltk.corpus import wordnet as wn
stack = []
duplicate_check = []
def iterate_all():
    while(stack):
        current_node = stack.pop()
        print current_node,"on top"
        for hypo in current_node.hyponyms():
            stack.append(hypo)
            duplicate_check.append(hypo)
if __name__ == "__main__":
    root = wn.synset("entity.n.01")
    stack.append(root)
    duplicate_check.append(root)
    iterate_all()
    correct_list = list(wn.all_synsets('n'))
#    print list( set(correct_list) - set(duplicate_check) )
    print len(correct_list)
    print len(duplicate_check)

我有duplicate_check的96,308条记录,correct_list的记录为82,115条。后者correct_list包含正确数量的同义词,但不包含duplicate_check

将两个列表都隐藏到set并检查两个列表中的元素之后,我发现我会通过上面列出的代码丢失名词关系中“实例”的关系。所以有人能告诉我:

(1)在WordNet 3.0中,“hyponyms”关系是否等于“instance of”?

(2)我的代码中是否有任何错误导致我无法在duplicate_list中添加“关系词实例”?

我非常感谢你的时间。

环境:   Ubuntu 14.04 + Python 2.7 + NLTK最新版本+ WordNet 3.0

2 个答案:

答案 0 :(得分:0)

首先,没有必要从entity.n.01自上而下迭代得到它的下位词,你只需检查所有同义词中的root_hypernyms botton-up:

>>> from nltk.corpus import wordnet as wn
>>> len(set(wn.all_synsets('n')))
82115
>>> entity = wn.synset('entity.n.01')
>>> len([i for i in wn.all_synsets('n') if entity in i.root_hypernyms()])
82115

以下是Synset.root_hypernyms()的工作代码https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L439

def root_hypernyms(self):
    """Get the topmost hypernyms of this synset in WordNet."""

    result = []
    seen = set()
    todo = [self]
    while todo:
        next_synset = todo.pop()
        if next_synset not in seen:
            seen.add(next_synset)
            next_hypernyms = next_synset.hypernyms() + \
                next_synset.instance_hypernyms()
            if not next_hypernyms:
                result.append(next_synset)
            else:
                todo.extend(next_hypernyms)
    return result

还有另一种方式可以访问超级/下级,但看起来它不像NLTK那样完美,请参阅How to get all the hyponyms of a word/synset in python nltk and wordnet?

>>> len(set([s for s in entity.closure(lambda s:s.hyponyms())]))
74373

单独迭代:

>>> for s in entity.closure(lambda s:s.hyponyms()):
...     print s

让我们尝试自下而上:

>>> from nltk.corpus import wordnet as wn
>>> 
>>> synsets_with_entity_root = 0
>>> entity = wn.synset('entity.n.01')
>>> 
>>> for i in wn.all_synsets('n'):
...     # Get root hypernym the hard way.
...     x = set([s for s in i.closure(lambda s:s.hypernyms())])
...     if entity in x:
...             synsets_with_entity_root +=1
... 

>>> print synsets_with_entity_root
74373

似乎在自下而上解析超级下位树并以这种方式充值时,我们缺少~8000个同义词,所以我们检查:

entity = wn.synset('entity.n.01')

for i in wn.all_synsets('n'):
    # Get root hypernym the hard way.
    x = set([s for s in i.closure(lambda s:s.hypernyms())])
    if entity in x:
        synsets_with_entity_root +=1
    else:
        print i, i.root_hypernyms()

你会得到一个缺失的~8000个同义词列表,这里是你会看到的前几个:

Synset('entity.n.01') [Synset('entity.n.01')]
Synset('hegira.n.01') [Synset('entity.n.01')]
Synset('underground_railroad.n.01') [Synset('entity.n.01')]
Synset('babylonian_captivity.n.01') [Synset('entity.n.01')]
Synset('creation.n.05') [Synset('entity.n.01')]
Synset('berlin_airlift.n.01') [Synset('entity.n.01')]
Synset('secession.n.02') [Synset('entity.n.01')]
Synset('human_genome_project.n.01') [Synset('entity.n.01')]
Synset('manhattan_project.n.02') [Synset('entity.n.01')]
Synset('peasant's_revolt.n.01') [Synset('entity.n.01')]
Synset('first_crusade.n.01') [Synset('entity.n.01')]
Synset('second_crusade.n.01') [Synset('entity.n.01')]
Synset('third_crusade.n.01') [Synset('entity.n.01')]
Synset('fourth_crusade.n.01') [Synset('entity.n.01')]
Synset('fifth_crusade.n.01') [Synset('entity.n.01')]
Synset('sixth_crusade.n.01') [Synset('entity.n.01')]
Synset('seventh_crusade.n.01') [Synset('entity.n.01')]

所以closure()方法可能有点有损,但如果不考虑确切的数字,它仍然是一种优雅的方法。

答案 1 :(得分:0)

此代码可防止错误发生:

<!doctype html>
<html>

<head>
  <meta charset="UTF-8">
  <link rel="stylesheet" href="style.css">
</head>

<body>

  <p>Drag the ball.</p>

  <img src="https://en.js.cx/clipart/soccer-gate.svg" id="gate" class="droppable">

  <img src="https://en.js.cx/clipart/ball.svg" id="ball">

  <script>
    let currentDroppable = null;

    ball.onmousedown = function(event) {

      let shiftX = event.clientX - ball.getBoundingClientRect().left;
      let shiftY = event.clientY - ball.getBoundingClientRect().top;

      ball.style.position = 'absolute';
      ball.style.zIndex = 1000;
      document.body.append(ball);

      moveAt(event.pageX, event.pageY);

      function moveAt(pageX, pageY) {
        ball.style.left = pageX - shiftX + 'px';
        ball.style.top = pageY - shiftY + 'px';
      }

      function onMouseMove(event) {
        moveAt(event.pageX, event.pageY);

        ball.hidden = true;
        let elemBelow = document.elementFromPoint(event.clientX, event.clientY);
        ball.hidden = false;

        if (!elemBelow) return;

        let droppableBelow = elemBelow.closest('.droppable');
        if (currentDroppable != droppableBelow) {
          if (currentDroppable) { // null when we were not over a droppable before this event
            leaveDroppable(currentDroppable);
          }
          currentDroppable = droppableBelow;
          if (currentDroppable) { // null if we're not coming over a droppable now
            // (maybe just left the droppable)
            enterDroppable(currentDroppable);
          }
        }
      }

      document.addEventListener('mousemove', onMouseMove);

      ball.onmouseup = function() {
        document.removeEventListener('mousemove', onMouseMove);
        ball.onmouseup = null;
      };

    };

    function enterDroppable(elem) {
      elem.style.background = 'pink';
    }

    function leaveDroppable(elem) {
      elem.style.background = '';
    }

    ball.ondragstart = function() {
      return false;
    };
    
    
    let x, y;   
    document.addEventListener('mousemove', e => {
      x = e.clientX;
      y = e.clientY;
    });

    document.addEventListener('keyup', e => {
      if(e.code === 'Space') {
        ball.style.position = 'absolute';
        ball.style.top = `${y}px`;
        ball.style.left = `${x}px`;
      }     
    });
  </script>


</body>
</html>

结果如下:

from nltk.corpus import wordnet
L=len(wordnet.synsets('rock', pos='n'))
for i in range(0,L):
    syn = wordnet.synsets(word)[i].name()
    print("---------",syn,"--------",wordnet.synsets(word)[i].definition())
    while(syn!='entity.n.01'):
        hyper=wordnet.synset(syn).hypernyms() 
        if len(hyper)>0:
            name=hyper[0].name().split('.')[0]
            print(name)
        else:
            print("**** INSTANCE_OF sense, without any hypernyms ****")
            break
        syn=hyper[0].name()