Python正则表达式捕获多个组N次

时间:2016-12-14 16:23:24

标签: regex python-3.x regex-group

我正在解析进程的/proc/PID/stat。该文件输入:

25473 (firefox) S 25468 25465 25465 0 -1 4194304 149151169 108282 32 15 2791321 436115 846 86 20 0 84 0 9648305 2937786368 209665 18446744073709551615 93875088982016 93875089099888 140722931705632 140722931699424 140660842079373 0 0 4102 33572009 0 0 0 17 1 0 0 175 0 0 93875089107104 93875089109128 93875116752896 140722931707410 140722931707418 140722931707418 140722931707879 0

我想出了:

import re

def get_stats(pid):
    with open('/proc/{}/stat'.format(pid)) as fh:
        stats_raw = fh.read()
    stat_pattern = '(\d+\s)(\(.+\)\s)(\w+\s)(-?\d+\s?)'
    return re.findall(stat_pattern, stats_raw)

这将匹配前三个组,但只返回最后一组(-?\d+\s?)的一个字段:

[('25473 ', '(firefox) ', 'S ', '25468 ')]

我一直在寻找一种方法来匹配最后一组的固定号码:

'(\d+\s)(\(.+\)\s)(\w+\s)(-?\d+\s?){49}'

1 个答案:

答案 0 :(得分:1)

您无法使用re正则表达式访问每次重复捕获。您可以将字符串的其余部分捕获到第4组中,然后使用空格分割:

import re
s = r'25473 (firefox) S 25468 25465 25465 0 -1 4194304 149151169 108282 32 15 2791321 436115 846 86 20 0 84 0 9648305 2937786368 209665 18446744073709551615 93875088982016 93875089099888 140722931705632 140722931699424 140660842079373 0 0 4102 33572009 0 0 0 17 1 0 0 175 0 0 93875089107104 93875089109128 93875116752896 140722931707410 140722931707418 140722931707418 140722931707879 0'
stat_pattern = r'(\d+)\s+(\([^)]+\))\s+(\w+)\s*(.*)'
res = []
for m in re.finditer(stat_pattern, s):
    res.append(m.group(1))
    res.append(m.group(2))
    res.append(m.group(3))
    res.extend(m.group(4).split())
print(res)

输出:

['25473', '(firefox)', 'S', '25468', '25465', '25465', '0', '-1', '4194304', '149151169', '108282', '32', '15', '2791321', '436115', '846', '86', '20', '0', '84', '0', '9648305', '2937786368', '209665', '18446744073709551615', '93875088982016', '93875089099888', '140722931705632', '140722931699424', '140660842079373', '0', '0', '4102', '33572009', '0', '0', '0', '17', '1', '0', '0', '175', '0', '0', '93875089107104', '93875089109128', '93875116752896', '140722931707410', '140722931707418', '140722931707418', '140722931707879', '0']

如果你真的只需要在第4组中获得49个数字,请使用

r'(\d+)\s+(\([^)]+\))\s+(\w+)\s*((?:-?\d+\s?){49})'
                                ^^^^^^^^^^^^^^^^^^

使用PyPi regex module,您可以使用r'(?P<o>\d+)\s+(?P<o>\([^)]+\))\s+(?P<o>\w+)\s+(?P<o>-?\d+\s?){49}'并在运行regex.search(pattern, s)访问.captures("o")堆栈后使用您需要的值。

>>> import regex
>>> s = '25473 (firefox) S 25468 25465 25465 0 -1 4194304 149151169 108282 32 15 2791321 436115 846 86 20 0 84 0 9648305 2937786368 209665 18446744073709551615 93875088982016 93875089099888 140722931705632 140722931699424 140660842079373 0 0 4102 33572009 0 0 0 17 1 0 0 175 0 0 93875089107104 93875089109128 93875116752896 140722931707410 140722931707418 140722931707418 140722931707879 0'
>>> stat_pattern = r'(?P<o>\d+)\s+(?P<o>\([^)]+\))\s+(?P<o>\w+)\s+(?P<o>-?\d+\s?){49}'
>>> m = regex.search(stat_pattern, s)
>>> if m:
    print(m.captures("o"))

输出:

['25473', '(firefox)', 'S', '25468 ', '25465 ', '25465 ', '0 ', '-1 ', '4194304 ', '149151169 ', '108282 ', '32 ', '15 ', '2791321 ', '436115 ', '846 ', '86 ', '20 ', '0 ', '84 ', '0 ', '9648305 ', '2937786368 ', '209665 ', '18446744073709551615 ', '93875088982016 ', '93875089099888 ', '140722931705632 ', '140722931699424 ', '140660842079373 ', '0 ', '0 ', '4102 ', '33572009 ', '0 ', '0 ', '0 ', '17 ', '1 ', '0 ', '0 ', '175 ', '0 ', '0 ', '93875089107104 ', '93875089109128 ', '93875116752896 ', '140722931707410 ', '140722931707418 ', '140722931707418 ', '140722931707879 ', '0']