Question

我正在寻找一种使用正则表达式和python re模块从字符串中提取一些数据的干净方法。字符串的每一行的格式均为key = value。我只对某些键感兴趣，但是对于某些字符串，这些键可能会丢失。我可以想到几种方法，可以通过逐行迭代字符串或使用re.finditer()来实现此目的，但我真正想做的是使用命名组和对{{1}的单个调用}，使用返回的match对象的re.match()方法以组字典结尾。当所有组都存在时，我可以使用命名组来做到这一点，但是似乎如果我将组设为可选，那么即使存在，它们也不会匹配。

我可能缺少明显的东西，但是有没有办法在单个正则表达式中做到这一点？还是我需要一个多步骤的过程？

.groupdict()

Answer 1

您可以用一行简单的正则表达式来解决这个问题。

>> dict(re.findall(r'^(type|count|destinations) = (\S*)$', string1, re.MULTILINE))
{'count': '5', 'type': 'Route', 'destinations': 'default'}

>> dict(re.findall(r'^(type|count|destinations) = (\S*)$', string2, re.MULTILINE))
{'type': 'Route', 'destinations': 'default'}

Answer 2

您可以使用类似于以下字典理解的方法，该方法根据所需字段名称的输入元组拆分和过滤键值对：

import re

def regexandgroup(instr: str, savekeys: tuple):
    exp = '^(\w+)[ \t:=]+([\w:]+)$'
    match = re.findall(exp, instr, re.MULTILINE)

    return {group[0]: group[1] for group in match if group[0] in savekeys}

哪个给了我们

>> print(regexandgroup(string1, ('type', 'count', 'destinations')))
{'type': 'Route', 'count': '5', 'destinations': 'default'}

>> print(regexandgroup(string2, ('type', 'count', 'destinations')))
{'type': 'Route', 'destinations': 'default'}

Answer 3

检查一下。

#python 3.5.2
import re

# trying to extract 'type', 'count' and 'destinations'.
# string1 has all keys and a single re.match works
# string2 is missing 'count'... any suggestions?

string1 = """
Name: default
type = Route
status = 0
count = 5
enabled = False
start_time = 18:00:00
end_time = 00:00:00
destinations = default
started = False
"""

string2 = """
Name: default
type = Route
status = 0
enabled = False
start_time = 18:00:00
end_time = 00:00:00
destinations = default
started = False
"""

pattern = re.compile(r"""
(?mx)\A
(?=(?:[\s\S]*?^\s*type\s*=\s*(?P<type>.*)$)?)
(?=(?:[\s\S]*?^\s*count\s*=\s*(?P<count>.*)$)?)
(?=(?:[\s\S]*?^\s*destinations\s*=\s*(?P<destinations>.*)$)?)
""")

m1 = re.match(pattern, string1)
print (m1.groupdict())

m2 = re.match(pattern, string2)
print (m2.groupdict())

要在线尝试，请单击here。

Answer 4

您没有真正指定是否可以缺少任何字段，或者count是否是唯一可能丢失的字段。但是，此模式将匹配您建议的所有3种情况，并将它们存储在命名的捕获组中。

type = (?<type>\S*)|count = (?<count>\d+)|destinations = (?<destinations>\S*)

Demo

|的意思是或，因此您正在寻找type = ...或count = ...或destinations = ...

Answer 5

只需提取键/值对，即可忽略其他键，也可以添加… if x.split(' = ')[0] in wanted_keys进行过滤。如果要填写缺少的密钥，请使用setdefault。

>>> dict(x.split(' = ') for x in string1.strip().splitlines()[1:])
{'status': '0', 'count': '5', 'started': 'False', 'start_time': '18:00:00', 'enabled': 'False', 'end_time': '00:00:00', 'type': 'Route', 'destinations': 'default'}

Answer 6

为什么不使用pandas一次完成所有操作？以下使用@ andrei-odegov

中的正则表达式

import pandas as pd


# create a Series object from your strings
s = pd.Series([string1, string2])

regex = r"""
    (?mx)\A
    (?=(?:[\s\S]*?^\s*type\s*=\s*(?P<type>.*)$)?)
    (?=(?:[\s\S]*?^\s*count\s*=\s*(?P<count>.*)$)?)
    (?=(?:[\s\S]*?^\s*destinations\s*=\s*(?P<destinations>.*)$)?)
"""

# return a DataFrame which contains your results
df = s.str.extract(regex, expand=True)

print(df)


    type count destinations
0  Route     5      default
1  Route   NaN      default

python regex可选命名组

6 个答案: