斯坦福在python中使用coreNLP输入依赖项

时间:2019-06-10 13:54:55

标签: python parsing nlp stanford-nlp

Stanford Dependency Manual中,他们提到“斯坦福类型依赖性”,尤其是类型“ neg”-否定修饰符。在通过网站使用Stanford Enhanced ++解析器时,也可以使用它。例如,句子:

  

“巴拉克·奥巴马(Barack Obama)并非在夏威夷出生”

enter image description here

解析器确实找到了neg(天生的,不是)

但是当我使用stanfordnlp python库时,我可以获得的唯一依赖项解析器将对句子进行如下解析:

('Barack', '5', 'nsubj:pass')

('Obama', '1', 'flat')

('was', '5', 'aux:pass')

('not', '5', 'advmod')

('born', '0', 'root')

('in', '7', 'case')

('Hawaii', '5', 'obl')

以及生成它的代码:

import stanfordnlp
stanfordnlp.download('en')  
nlp = stanfordnlp.Pipeline()
doc = nlp("Barack Obama was not born in Hawaii")
a  = doc.sentences[0]
a.print_dependencies()

是否有办法获得与增强型依赖项解析器或其他任何斯坦福解析器类似的结果,这些结果会导致类型化的依赖项使我得到否定修饰符?

5 个答案:

答案 0 :(得分:2)

请注意,python库stanfordnlp不仅仅是StanfordCoreNLP的python包装器。

1。区别StanfordNLP / CoreNLP

stanfordnlp Github repo所述:

  

斯坦福大学NLP集团的官方Python NLP库。它包含了   软件包,用于运行来自CoNLL的最新的全神经管道   2018共享任务,并用于访问Java Stanford CoreNLP服务器。

Stanfordnlp包含一组新的神经网络模型,这些模型经过CONLL 2018共享任务训练。在线解析器基于CoreNLP 3.9.2 Java库。如here所述,它们是两个不同的管道和模型集。

您的代码仅访问经过CONLL 2018数据训练的神经管道。这解释了您看到的与在线版本相比的差异。基本上是两种不同的模型。

让我感到困惑的是,这两个存储库都属于名为stanfordnlp(即团队名称)的用户。不要在Java stanfordnlp / CoreNLP和python stanfordnlp / stanfordnlp之间迷惑。

关于您的“否定”问题,似乎他们在python libabry stanfordnlp中决定完全考虑使用“ advmod”注解进行否定。至少这就是我遇到的一些例句。

2。通过stanfordnlp软件包使用CoreNLP

但是,您仍然可以通过stanfordnlp程序包访问CoreNLP。不过,它还需要一些步骤。引用Github仓库,

  

有一些初始设置步骤。

     
      
  • 下载Stanford CoreNLP和您想要使用的语言的模型。 (you can download CoreNLP and the language models here)
  •   
  • 将模型罐子放在分发文件夹中
  •   
  • 告诉斯坦福CoreNLP所在的python代码:export CORENLP_HOME = / path / to / stanford-corenlp-full-2018-10-05
  •   

完成后,您可以使用demo中的代码启动客户端:

from stanfordnlp.server import CoreNLPClient 

with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
    # submit the request to the server
    ann = client.annotate(text)

    # get the first sentence
    sentence = ann.sentence[0]

    # get the dependency parse of the first sentence
    print('---')
    print('dependency parse of first sentence')
    dependency_parse = sentence.basicDependencies
    print(dependency_parse)

    #get the tokens of the first sentence
    #note that 1 token is 1 node in the parse tree, nodes start at 1
    print('---')
    print('Tokens of first sentence')
    for token in sentence.token :
        print(token)

因此,如果您指定“ depparse”注释器(以及必备注释器标记化,分割和pos),则将对您的句子进行解析。 阅读该演示,感觉我们只能访问basicDependencies。我还没有通过stanfordnlp使Enhanced ++依赖项起作用。

但是如果您使用basicDependencies,否定词仍会出现!

这是我使用stanfordnlp和您的例句获得的输出。它是一个DependencyGraph对象,不是很漂亮,但是不幸的是,当我们使用非常深入的CoreNLP工具时,情况总是如此。您将看到在节点4和5(“ not”和“ born”)之间,存在边“ neg”。

node {
  sentenceIndex: 0
  index: 1
}
node {
  sentenceIndex: 0
  index: 2
}
node {
  sentenceIndex: 0
  index: 3
}
node {
  sentenceIndex: 0
  index: 4
}
node {
  sentenceIndex: 0
  index: 5
}
node {
  sentenceIndex: 0
  index: 6
}
node {
  sentenceIndex: 0
  index: 7
}
node {
  sentenceIndex: 0
  index: 8
}
edge {
  source: 2
  target: 1
  dep: "compound"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 2
  dep: "nsubjpass"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 3
  dep: "auxpass"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 4
  dep: "neg"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 7
  dep: "nmod"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 8
  dep: "punct"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 7
  target: 6
  dep: "case"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
root: 5

---
Tokens of first sentence
word: "Barack"
pos: "NNP"
value: "Barack"
before: ""
after: " "
originalText: "Barack"
beginChar: 0
endChar: 6
tokenBeginIndex: 0
tokenEndIndex: 1
hasXmlContext: false
isNewline: false

word: "Obama"
pos: "NNP"
value: "Obama"
before: " "
after: " "
originalText: "Obama"
beginChar: 7
endChar: 12
tokenBeginIndex: 1
tokenEndIndex: 2
hasXmlContext: false
isNewline: false

word: "was"
pos: "VBD"
value: "was"
before: " "
after: " "
originalText: "was"
beginChar: 13
endChar: 16
tokenBeginIndex: 2
tokenEndIndex: 3
hasXmlContext: false
isNewline: false

word: "not"
pos: "RB"
value: "not"
before: " "
after: " "
originalText: "not"
beginChar: 17
endChar: 20
tokenBeginIndex: 3
tokenEndIndex: 4
hasXmlContext: false
isNewline: false

word: "born"
pos: "VBN"
value: "born"
before: " "
after: " "
originalText: "born"
beginChar: 21
endChar: 25
tokenBeginIndex: 4
tokenEndIndex: 5
hasXmlContext: false
isNewline: false

word: "in"
pos: "IN"
value: "in"
before: " "
after: " "
originalText: "in"
beginChar: 26
endChar: 28
tokenBeginIndex: 5
tokenEndIndex: 6
hasXmlContext: false
isNewline: false

word: "Hawaii"
pos: "NNP"
value: "Hawaii"
before: " "
after: ""
originalText: "Hawaii"
beginChar: 29
endChar: 35
tokenBeginIndex: 6
tokenEndIndex: 7
hasXmlContext: false
isNewline: false

word: "."
pos: "."
value: "."
before: ""
after: ""
originalText: "."
beginChar: 35
endChar: 36
tokenBeginIndex: 7
tokenEndIndex: 8
hasXmlContext: false
isNewline: false

2。通过NLTK软件包使用CoreNLP

我不会在这方面做详细介绍,但是如果所有其他方法都失败了,那么还有一种解决方案可以通过NLTK库访问CoreNLP服务器。它确实输出否定,但是需要更多的工作来启动服务器。 有关this page

的详细信息

编辑

我想我也可以与您共享代码,以便将DependencyGraph放入类似于stanfordnlp输出的形状的“ dependency,argument1,argument2”的漂亮列表中。

from stanfordnlp.server import CoreNLPClient

text = "Barack Obama was not born in Hawaii."

# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
    # submit the request to the server
    ann = client.annotate(text)

    # get the first sentence
    sentence = ann.sentence[0]

    # get the dependency parse of the first sentence
    dependency_parse = sentence.basicDependencies

    #print(dir(sentence.token[0])) #to find all the attributes and methods of a Token object
    #print(dir(dependency_parse)) #to find all the attributes and methods of a DependencyGraph object
    #print(dir(dependency_parse.edge))

    #get a dictionary associating each token/node with its label
    token_dict = {}
    for i in range(0, len(sentence.token)) :
        token_dict[sentence.token[i].tokenEndIndex] = sentence.token[i].word

    #get a list of the dependencies with the words they connect
    list_dep=[]
    for i in range(0, len(dependency_parse.edge)):

        source_node = dependency_parse.edge[i].source
        source_name = token_dict[source_node]

        target_node = dependency_parse.edge[i].target
        target_name = token_dict[target_node]

        dep = dependency_parse.edge[i].dep

        list_dep.append((dep, 
            str(source_node)+'-'+source_name, 
            str(target_node)+'-'+target_name))
    print(list_dep)

它输出以下

[('compound', '2-Obama', '1-Barack'), ('nsubjpass', '5-born', '2-Obama'), ('auxpass', '5-born', '3-was'), ('neg', '5-born', '4-not'), ('nmod', '5-born', '7-Hawaii'), ('punct', '5-born', '8-.'), ('case', '7-Hawaii', '6-in')]

答案 1 :(得分:1)

# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'depparse'], timeout=60000, memory='16G') as client:
    # submit the request to the server
    ann = client.annotate(text)

    offset = 0 # keeps track of token offset for each sentence
    for sentence in ann.sentence:
        print('___________________')
        print('dependency parse:')
        # extract dependency parse
        dp = sentence.basicDependencies
        # build a helper dict to associate token index and label
        token_dict = {sentence.token[i].tokenEndIndex-offset : sentence.token[i].word for i in range(0, len(sentence.token))}
        offset += len(sentence.token)

        # build list of (source, target) pairs
        out_parse = [(dp.edge[i].source, dp.edge[i].target) for i in range(0, len(dp.edge))]

        for source, target in out_parse:
            print(source, token_dict[source], '->', target, token_dict[target])

        print('\nTokens \t POS \t NER')
        for token in sentence.token:
            print (token.word, '\t', token.pos, '\t', token.ner)

这将为第一句话输出以下内容:

___________________
dependency parse:
2 Obama -> 1 Barack
4 born -> 2 Obama
4 born -> 3 was
4 born -> 6 Hawaii
4 born -> 7 .
6 Hawaii -> 5 in

Tokens   POS     NER
Barack   NNP     PERSON
Obama    NNP     PERSON
was      VBD     O
born     VBN     O
in       IN      O
Hawaii   NNP     STATE_OR_PROVINCE
.        .       O

答案 2 :(得分:1)

2021 年:

注意:从终端运行此代码,由于某些标准输入兼容性问题,它无法在笔记本上运行。

import os
os.environ["CORENLP_HOME"] = "./stanford-corenlp-4.2.0"
import pandas as pd
from stanza.server import CoreNLPClient

答案 3 :(得分:0)

我认为,用于生成文档依赖性的模型与可以在线获取的模型之间可能存在差异,因此存在差异。我会直接通过GitHub issuesstanfordnlp库维护者提出这个问题。

答案 4 :(得分:0)

另一种选择是 SpaCy ( https://spacy.io/api/dependencyparser )

  • const ss=SpreadsheetApp.getActive(); function onOpen(e) { CreateMenuOptions(e); } function onSelectionChange(e) { CreateMenuOptions(e); } function CreateMenuOptions(e){ var sheetName = e.range.getSheet().getName(); var menuEntries = [ { name: "Upload " + sheetName + " to Server", functionName: "ExportSheet" }, ]; ss.updateMenu("Export to JSON", menuEntries); ss.toast(sheetName); console.log(sheetName); }
  • pip install -U pip setuptools wheel
  • pip install -U spacy
python -m spacy download en_core_web_lg

输出为:

import spacy
nlp = spacy.load('en_core_web_lg')

def printInfo(doc):
    for token in doc:
        print(token.text, token.lemma_, token.pos_, token.tag_,
            token.shape_, token.is_alpha,
       token.is_stop, token.ent_type_, token.dep_, token.head.text)

doc = nlp("Barack Obama was not born in Hawaii")
printInfo(doc)