如何在Python中找到两个单词之间的最短依赖路径?

时间:2015-09-29 03:39:17

标签: python nltk text-parsing

我尝试在给定依赖关系树的Python中找到两个单词之间的依赖路径。

对于句子

  

流行文化中的机器人在那里提醒我们真棒   无约束的人类机构。

我使用了practnlptools(https://github.com/biplab-iitb/practNLPTools)来获取依赖项解析结果,如:

nsubj(are-5, Robots-1)
xsubj(remind-8, Robots-1)
amod(culture-4, popular-3)
prep_in(Robots-1, culture-4)
root(ROOT-0, are-5)
advmod(are-5, there-6)
aux(remind-8, to-7)
xcomp(are-5, remind-8)
dobj(remind-8, us-9)
det(awesomeness-12, the-11)
prep_of(remind-8, awesomeness-12)
amod(agency-16, unbound-14)
amod(agency-16, human-15)
prep_of(awesomeness-12, agency-16)

也可以看作是(图片来自https://demos.explosion.ai/displacy/enter image description here

“robots”和“are”之间的路径长度为1,“robots”和“awesomeness”之间的路径长度为4.

我的问题是上面的依赖解析结果,我怎样才能获得两个单词之间的依赖路径或依赖路径长度?

从我目前的搜索结果中,nltk的ParentedTree会有帮助吗?

谢谢!

3 个答案:

答案 0 :(得分:11)

您的问题很容易被视为一个图形问题,我们必须找到两个节点之间的最短路径。

要在图形中转换依赖项解析,我们首先必须处理它作为字符串的事实。你想得到这个:

'nsubj(are-5, Robots-1)\nxsubj(remind-8, Robots-1)\namod(culture-4, popular-3)\nprep_in(Robots-1, culture-4)\nroot(ROOT-0, are-5)\nadvmod(are-5, there-6)\naux(remind-8, to-7)\nxcomp(are-5, remind-8)\ndobj(remind-8, us-9)\ndet(awesomeness-12, the-11)\nprep_of(remind-8, awesomeness-12)\namod(agency-16, unbound-14)\namod(agency-16, human-15)\nprep_of(awesomeness-12, agency-16)'

看起来像这样:

[('are-5', 'Robots-1'), ('remind-8', 'Robots-1'), ('culture-4', 'popular-3'), ('Robots-1', 'culture-4'), ('ROOT-0', 'are-5'), ('are-5', 'there-6'), ('remind-8', 'to-7'), ('are-5', 'remind-8'), ('remind-8', 'us-9'), ('awesomeness-12', 'the-11'), ('remind-8', 'awesomeness-12'), ('agency-16', 'unbound-14'), ('agency-16', 'human-15'), ('awesomeness-12', 'agency-16')]

这样你可以从networkx模块将元组列表提供给图形构造函数,该模块将分析列表并为你构建图形,并为您提供一个简洁的方法,为您提供最短路径的长度在两个给定节点之间。

必要的导入

import re
import networkx as nx
from practnlptools.tools import Annotator

如何以所需的元组列表格式获取字符串

annotator = Annotator()
text = """Robots in popular culture are there to remind us of the awesomeness of unbound human agency."""
dep_parse = annotator.getAnnotations(text, dep_parse=True)['dep_parse']

dp_list = dep_parse.split('\n')
pattern = re.compile(r'.+?\((.+?), (.+?)\)')
edges = []
for dep in dp_list:
    m = pattern.search(dep)
    edges.append((m.group(1), m.group(2)))

如何制作图表

graph = nx.Graph(edges)  # Well that was easy

如何计算最短路径长度

print(nx.shortest_path_length(graph, source='Robots-1', target='awesomeness-12'))

此脚本将显示给定依赖关系解析的最短路径实际上是长度为2,因为您可以通过Robots-1

awesomeness-12remind-8
1. xsubj(remind-8, Robots-1) 
2. prep_of(remind-8, awesomeness-12)

如果你不喜欢这个结果,你可能想考虑过滤一些依赖项,在这种情况下不允许将xsubj依赖项添加到图中。

答案 1 :(得分:8)

HugoMailhot' answer很棒。我会为那些希望在两个单词之间找到最短依赖路径的spacy用户写一些类似的内容(而HugoMailhot的答案依赖于practNLPTools)。

句子:

  

流行文化中的机器人在那里提醒我们真棒   无约束的人类机构。

following dependency tree

enter image description here

以下是查找两个单词之间最短依赖路径的代码:

import networkx as nx
import spacy
nlp = spacy.load('en')

# https://spacy.io/docs/usage/processing-text
document = nlp(u'Robots in popular culture are there to remind us of the awesomeness of unbound human agency.', parse=True)

print('document: {0}'.format(document))

# Load spacy's dependency tree into a networkx graph
edges = []
for token in document:
    # FYI https://spacy.io/docs/api/token
    for child in token.children:
        edges.append(('{0}-{1}'.format(token.lower_,token.i),
                      '{0}-{1}'.format(child.lower_,child.i)))

graph = nx.Graph(edges)

# https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.shortest_paths.html
print(nx.shortest_path_length(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='agency-15'))

输出:

4
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11']
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11', 'of-12', 'agency-15']

安装spacy和networkx:

sudo pip install networkx 
sudo pip install spacy
sudo python -m spacy.en.download parser # will take 0.5 GB

有关spacy的依赖性解析的一些基准:https://spacy.io/docs/api/

enter image description here

答案 2 :(得分:2)

这个答案依赖于Stanford CoreNLP来获取句子的依赖树。在使用networkx时,它借用了HugoMailhot的answer中的一些代码。

在运行代码之前,需要:

  1. sudo pip install pycorenlp(斯坦福CoreNLP的python接口)
  2. 下载Stanford CoreNLP
  3. 按如下方式启动Stanford CoreNLP服务器:

    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 50000
    
  4. 然后可以运行以下代码来找到两个单词之间的最短依赖路径:

    import networkx as nx
    from pycorenlp import StanfordCoreNLP
    from pprint import pprint
    
    nlp = StanfordCoreNLP('http://localhost:{0}'.format(9000))
    def get_stanford_annotations(text, port=9000,
                                 annotators='tokenize,ssplit,pos,lemma,depparse,parse'):
        output = nlp.annotate(text, properties={
            "timeout": "10000",
            "ssplit.newlineIsSentenceBreak": "two",
            'annotators': annotators,
            'outputFormat': 'json'
        })
        return output
    
    # The code expects the document to contains exactly one sentence.
    document =  'Robots in popular culture are there to remind us of the awesomeness of'\
                'unbound human agency.'
    print('document: {0}'.format(document))
    
    # Parse the text
    annotations = get_stanford_annotations(document, port=9000,
                                           annotators='tokenize,ssplit,pos,lemma,depparse')
    tokens = annotations['sentences'][0]['tokens']
    
    # Load Stanford CoreNLP's dependency tree into a networkx graph
    edges = []
    dependencies = {}
    for edge in annotations['sentences'][0]['basic-dependencies']:
        edges.append((edge['governor'], edge['dependent']))
        dependencies[(min(edge['governor'], edge['dependent']),
                      max(edge['governor'], edge['dependent']))] = edge
    
    graph = nx.Graph(edges)
    #pprint(dependencies)
    #print('edges: {0}'.format(edges))
    
    # Find the shortest path
    token1 = 'Robots'
    token2 = 'awesomeness'
    for token in tokens:
        if token1 == token['originalText']:
            token1_index = token['index']
        if token2 == token['originalText']:
            token2_index = token['index']
    
    path = nx.shortest_path(graph, source=token1_index, target=token2_index)
    print('path: {0}'.format(path))
    
    for token_id in path:
        token = tokens[token_id-1]
        token_text = token['originalText']
        print('Node {0}\ttoken_text: {1}'.format(token_id,token_text))
    

    输出结果为:

    document: Robots in popular culture are there to remind us of the awesomeness of unbound human agency.
    path: [1, 5, 8, 12]
    Node 1  token_text: Robots
    Node 5  token_text: are
    Node 8  token_text: remind
    Node 12 token_text: awesomeness
    

    请注意,可以在线测试Stanford CoreNLP:http://nlp.stanford.edu:8080/parser/index.jsp

    此答案在Windows 7 SP1 x64 Ultimate上使用Stanford CoreNLP 3.6.0。,pycorenlp 0.3.0和python 3.5 x64进行了测试。