Question

我是Regex的新手。存在格式为“（ENTITY A）-[：RELATION {}]->（ENTITY B）”的数据，例如（Canberra）-[：capital_of {}]->（Australia）。如何提取两个实体和关系？

我尝试了以下代码：

path = "(Canberra)-[:capital_of {}]->(Australia)"
pattern = r'\(.*\)\-\[\:.*\]\-\>\(.*\)'
re.match(pattern,path).group()

但是它与整个句子匹配。任何帮助将不胜感激。

Answer 1

如果您不需要使用正则表达式，则可以使用

s="(Canberra)-[:capital_of {}]->(Australia)"
entityA = s[1:].split(')-')[0]
entityB = s.split('->(')[-1][:-1]

根据出现的')-'子字符串对输入字符串进行拆分，并采用第一部分的形式获取第一个实体。

根据split()子字符串完成'->('，并选择最后一个拆分以获取第二个实体。

所以

print(f'EntityA: {entityA}')
print(f'EntityB: {entityB}')

会给

EntityA: Canberra
EntityB: Australia

非正则表达式解决方案通常更快。

编辑：评论中要求的时间。

s="(Canberra)-[:capital_of {}]->(Australia)"
def regex_soln(s):
    pattern = r'\((.*)\)\-\[(:.*)\]\-\>\((.*)\)'
    rv = re.match(pattern,s).groups()
    return rv[0], rv[-1]

def non_regex_soln(s):
    return s[1:].split(')-')[0], s.split('->(')[-1][:-1]

%timeit regex_soln(s)
1.47 µs ± 60.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


%timeit non_regex_soln(s)
619 ns ± 30.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Answer 2

您快到了。您需要通过将要捕获的每个组括在()中来对其进行定义。

代码看起来像

import re
path = "(Canberra)-[:capital_of {}]->(Australia)"
pattern = r'\((.*)\)\-\[(:.*)\]\-\>\((.*)\)'
print(re.match(pattern,path).groups())

输出将是

('Canberra', ':capital_of {}', 'Australia')

Answer 3

这看起来像一些DSL，是 d 的 s 具体的 l 的语言，所以您不妨写一小段解析器。在这里，我们使用a PEG parser called parsimonious。

您将需要一个小的语法和一个NodeVisitor类：

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

path = "(Canberra)-[:capital_of {}]->(Australia)"

class PathVisitor(NodeVisitor):
    grammar = Grammar(
        r"""
        path    = (pair junk?)+
        pair    = lpar notpar rpar

        lpar    = ~"[(\[]+"
        rpar    = ~"[)\]]+"

        notpar  = ~"[^][()]+"
        junk    = ~"[-:>]+"
        """
    )

    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_pair(self, node, visited_children):
        _, value, _ = visited_children
        return value.text

    def visit_path(self, node, visited_children):
        return [child[0] for child in visited_children]

pv = PathVisitor()
output = pv.parse(path)
print(output)

哪个会屈服

['Canberra', ':capital_of {}', 'Australia']

如何使用正则表达式提取多个字符串

3 个答案: