如何找到两个正则表达式之间的最短距离

时间:2019-12-04 17:48:12

标签: python regex

我有一组文档,可以在其中搜索特定实体,并且我需要找到两者之间最短的距离。 假设我有一个文档,在其中搜索TrumpUkraine,然后得到提及列表及其开始和结束位置:

import re

text = """
 Three constitutional scholars invited by Democrats to testify at Wednesday’s impeachment hearings said that President Trump’s efforts to pressure Ukraine for political gain clearly meet the historical definition of impeachable offenses, according to copies of their opening statements.
 ˜Noah Feldman, a professor at Harvard, argued that attempts by Mr. Trump to withhold a White House meeting and military assistance from Ukraine as leverage for political favors constitute impeachable conduct, as does the act of soliciting foreign assistance on a phone call with Ukraine’s leader.
"""
p1 = re.compile("Trump")
p2 = re.compile("Ukraine")
res1 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p1.finditer(text)]
res2 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p2.finditer(text)]
print(res1)
print(res2)

输出:

[{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

在这种情况下,答案为148 - 125 = 23。您将如何建议以最pythonic的方式做到这一点?

3 个答案:

答案 0 :(得分:2)

一种解决方案是提取匹配项并找到其长度,如下所示

min([len(x) for x in re.findall(r'Trump(.*?)Ukraine', text)])

这里打印23张

答案 1 :(得分:2)

使用itertools.product

min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0)

或者使用最新的Python 3.8+利用 海象 运算符,我想您也可以做(未经测试):

min(res for x, y in product(res2, res1) if res := x['start'] - y['end'] > 0)

代码

from itertools import product

res1 = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
res2 =[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

print(min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0))
# 23

答案 2 :(得分:0)

别忘了取两点之间的距离的绝对值,否则最短的距离将变为负值,这是我认为不是您想要的:

dict = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}, {'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

shortest = 99999999
start = -1
end = -1

for i in range(len(dict)):
    for j in range(len(dict)):
        if(i != j):
            dist = abs(dict[i]['start'] - dict[j]['end'])
            if(dist < shortest):
                shortest = dist
                start = i
                end = j

print("Start: {}, end: {}, distance: {}\n".format(dict[start]['name'], dict[end]['name'], shortest))