如何组合这些正则表达式以匹配此文本中的变体?

时间:2016-02-28 14:18:32

标签: python regex

我有这种格式的数据:

Charter by <company> from <origin> to <destination>

可能缺少任何或所有by <company>from <origin>to <destination>块。我正在尝试编写一个正则表达式,它将a)匹配公司,来源和目的地,以及b)考虑到例如公司名称可能丢失的事实,在这种情况下它应该是空白的。

一种选择是为每个可能的块组合编写单独的正则表达式,如下所示:

import re

def parse_line(line):
    pattern = "^Charter by ([\S ]+) from ([\S ]+) to ([\S ]+)$"
    match = re.match(pattern, line)
    if match is not None:
        company, origin, destination = match.groups()
        return((company, origin, destination))

    pattern = "^Charter by ([\S ]+) from ([\S ]+)$"
    match = re.match(pattern, line)
    if match is not None:
        company, origin = match.groups()
        destination = ""
        return((company, origin, destination))

    # other pattern combinations
    # etc...


def main():
    data = """Charter by Maersk from China to England
Charter from France
Charter by Safmarine to Poland
Charter by Safmarine from Los Angeles
Charter
Charter to New York
"""

    for line in data.splitlines():
        result = parse_line(line)
        if result is not None:
            company, origin, destination = parse_line(line)
            print("{0}/{1}/{2}".format(company, origin, destination))

main()  

这对于这个简单的,设计的示例数据来说很烦人但是可行,但我的实际数据要复杂得多:每行最多可以有10个“块”,所以手动指定每个2 ^ 10种可能的组合是不可行的。

我认为这种模式可行:

pattern = "^Charter( by ([\S ]+))?( from ([\S ]+))?( to ([\S ]+))?$"
match = re.split(pattern, line)

因为它允许每个块都是可选的,但作为示例,对于行Charter by Maersk from China to England,拆分返回

['', ' by Maersk from China to England', 'Maersk from China to England', None, None, None, None, '']

显然,问题是第一个[\S ]+一直匹配到字符串的结尾,而不是停留在from(注意前导空格),但我不确定如何处理这个,因为公司名称,起源和目的地都可以包括空格。一旦我把这个模式敲了出来,named groups就可以让你更轻松地拉出它们。

5 个答案:

答案 0 :(得分:1)

只需使用非贪婪模式表单:

pattern = "^Charter( by ([\S ]+?))?( from ([\S ]+?))?( to ([\S ]+?))?$"

在你的例子中,这给出了:

['', ' by Maersk', 'Maersk', ' from China', 'China', ' to England', 'England', '']

答案 1 :(得分:0)

假设by总是在from之前存在且fromto之前存在(如果存在),那么让我们使用3个正则表达式来捕获by,from和to值依赖于每个块后面的内容。例如,在捕获by值时,我们将提取by和“from”或“to”或字符串末尾之间的所有内容。

代码:

import re

data = """Charter by Maersk from China to England
Charter from France
Charter by Safmarine to Poland
Charter by Safmarine from Los Angeles
Charter
Charter to New York
"""

company_pattern = re.compile(r"by (.*?)(?:from|to|$)")
origin_pattern = re.compile(r"from (.*?)(?:to|$)")
destination_pattern = re.compile(r"to (.*?)$")
for line in data.splitlines():
    match = company_pattern.search(line)
    company = match.group(1).strip() if match else ""

    match = origin_pattern.search(line)
    origin = match.group(1).strip() if match else ""

    match = destination_pattern.search(line)
    destination = match.group(1).strip() if match else ""

    print([company, origin, destination])

打印:

['Maersk', 'China', 'England']
['', 'France', '']
['Safmarine', '', 'Poland']
['Safmarine', 'Los Angeles', '']
['', '', '']
['', '', 'New York']

请注意,(?:...)表示非捕获组.*?是任何字符的非贪婪匹配。

答案 2 :(得分:0)

你需要的正则表达式可能是:

^Charter(?: by ([\S ]+?))?(?: from ([\S ]+?))?(?: to ([\S ]+?))?$

Regular expression visualization

Debuggex Demo

注意:

  1. (...)正在捕捉群组,即。您可以访问的群组,例如.groups()(?:...)无法抓取,.groups()中未显示。

  2. [\S ]+贪婪 - 尽可能匹配。 [\S ]+?是懒惰的 - 匹配可能的最短文本。

  3. (...)?(?:...)?群组是可选的 - 它可能会也可能不会出现在文字中。

  4. re.split是错误的工具:使用re.match(或re.search),例如:

       import re
       pattern = r'^Charter(?: by ([\S ]+?))?(?: from ([\S ]+?))?(?: to ([\S ]+?))?$'
    
       match = re.match(pattern, 'Charter by Maersk from China to England')
       match.groups()
    => ('Maersk', 'China', 'England')
    
       match = re.match(pattern, 'Charter')
       match.groups()
    => (None, None, None)
    

答案 3 :(得分:0)

这是一种包含三个正则表达式re.findall和辅助函数的方法:

def joiner(x):
    if x: return ''.join(x[0])
    else: return ''

patterns = [re.compile(r'by ([A-Za-z]+)(\s[A-Z][a-z]*)?'),
            re.compile(r'to ([A-Za-z]+)(\s[A-Z][a-z]*)?'),
            re.compile(r'from ([A-Za-z]+)(\s[A-Z][a-z]*)?')]

results = [[joiner(re.findall(p, line)) for line in data.splitlines()] for p in patterns]

输出:

[['Maersk', '', 'Safmarine', 'Safmarine', '', ''],
 ['China', 'France', '', 'Los Angeles', '', ''],
 ['England', '', 'Poland', '', '', 'New York']]

速度并不差:

In [175]: %timeit [[joiner(re.findall(p, line)) for line in data.splitlines()] for p in patterns]
10000 loops, best of 3: 42.8 µs per loop

如果不是像“纽约”和“洛杉矶”那样的城市名称会更快/更简单

答案 4 :(得分:-3)

我想我明白了,请告诉我它是否在帮助你:

regexp = "^by(.*)from(.*)to(.*)$"

.意味着每个人物 *表示0或多个时间