Question

我想使用Python来分析OS X上的/var/log/monthly.out以导出用户会计总计。日志文件如下所示：

Mon Feb  1 09:12:41 GMT 2016

Rotating fax log files:

Doing login accounting:
    total      688.31
    example   401.12
    _mbsetupuser   287.10
    root         0.05
    admin     0.04

-- End of monthly output --

Tue Feb 16 14:27:21 GMT 2016

Rotating fax log files:

Doing login accounting:
    total        0.00

-- End of monthly output --

Thu Mar  3 09:37:31 GMT 2016

Rotating fax log files:

Doing login accounting:
    total      377.92
    example   377.92

-- End of monthly output --

我能够使用此正则表达式提取用户名/总计对：

\t(\w*)\W*(\d*\.\d{2})

在Python中：

>>> import re
>>> re.findall(r'\t(\w*)\W*(\d*\.\d{2})', open('/var/log/monthly.out', 'r').read())
[('total', '688.31'), ('example', '401.12'), ('_mbsetupuser', '287.10'), ('root', '0.05'), ('admin', '0.04'), ('total', '0.00'), ('total', '377.92'), ('example', '377.92')]

但我无法弄清楚如何以这种方式提取日期行，并将其附加到该月的用户名/总计对。

Answer 1

使用str.split()。

import re

re_user_amount = r'\s+(\w+)\s+(\d*\.\d{2})'
re_date = r'\w{3}\s+\w{3}\s+\d+\s+\d\d:\d\d:\d\d \w+ \d{4}'

with open('/var/log/monthly.out', 'r') as f:
    content = f.read()
    sections = content.split('-- End of monthly output --')

    for section in sections:
        date = re.findall(re_date, section)
        matches = re.findall(re_user_amount, section)

        print(date, matches)

如果您想将日期字符串转换为实际日期时间，请查看Converting string into datetime。

Answer 2

嗯，很少有基于正则表达式的神奇疗法。正则表达式是一个很好的工具对于简单的字符串解析，但它不能取代好的旧编程！

因此，如果您查看数据，您会注意到它始终以日期开头，并以 -- End of monthly output --行。因此，处理这种情况的一个好方法是拆分数据每月产量。

让我们从您的数据开始：

>>> s = """\
... Mon Feb  1 09:12:41 GMT 2016
... 
... Rotating fax log files:
... 
... Doing login accounting:
...     total      688.31
...     example   401.12
...     _mbsetupuser   287.10
...     root         0.05
...     admin     0.04
... 
... -- End of monthly output --
... 
... Tue Feb 16 14:27:21 GMT 2016
... 
... Rotating fax log files:
... 
... Doing login accounting:
...     total        0.00
... 
... -- End of monthly output --
... 
... Thu Mar  3 09:37:31 GMT 2016
... 
... Rotating fax log files:
... 
... Doing login accounting:
...     total      377.92
...     example   377.92
... 
... -- End of monthly output --"""

让我们根据月末行分开它：

>>> reports = s.split('-- End of monthly output --')
>>> reports
['Mon Feb  1 09:12:41 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n    total      688.31\n    example   401.12\n    _mbsetupuser   287.10\n    root         0.05\n    admin     0.04\n\n', '\n\nTue Feb 16 14:27:21 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n    total        0.00\n\n', '\n\nThu Mar  3 09:37:31 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n    total      377.92\n    example   377.92\n\n', '']

然后，您可以将会计数据与日志的其余部分分开：

>>> report = reports[0]
>>> head, tail = report.split('Doing login accounting:')

现在让我们提取日期行：

>>> date_line = head.strip().split('\n')[0]

用这些用户名/总计对填写一个字典：

>>> accounting = dict(zip(tail.split()[::2], tail.split()[1::2]))

这里的技巧是使用zip()在tail上的迭代器中创建对。＆＃34;左边＆＃34; 该对的一侧是从索引0开始的迭代器，迭代每2个项目，〜右〜该对的一侧是从索引1开始的迭代器，每2个项重复一次。这使得：

{'admin': '0.04', 'root': '0.05', 'total': '688.31', '_mbsetupuser': '287.10', 'example': '401.12'}

所以现在已经完成了，你可以在for循环中做到这一点：

import datetime

def parse_monthly_log(log_path='/var/log/monthly.out'):
    with open(log_path, 'r') as log:
        reports = log.read().strip('\n ').split('-- End of monthly output --')
        for report in filter(lambda it: it, reports):
            head, tail = report.split('Doing login accounting:')
            date_line = head.strip().split('\n')[0]
            accounting = dict(zip(tail.split()[::2], tail.split()[1::2]))
            yield {
                'date': datetime.datetime.strptime(date_line.replace('  ', ' 0'), '%a %b %d %H:%M:%S %Z %Y'),
                'accounting': accounting
            }

>>> import pprint
>>> pprint.pprint(list(parse_monthly_log()), indent=2)
[ { 'accounting': { '_mbsetupuser': '287.10',
                    'admin': '0.04',
                    'example': '401.12',
                    'root': '0.05',
                    'total': '688.31'},
    'date': datetime.datetime(2016, 2, 1, 9, 12, 41)},
{ 'accounting': { 'total': '0.00'},
    'date': datetime.datetime(2016, 2, 16, 14, 27, 21)},
{ 'accounting': { 'example': '377.92', 'total': '377.92'},
    'date': datetime.datetime(2016, 3, 3, 9, 37, 31)}]

你可以使用没有单一正则表达式的pythonic解决方案。

注意：我不得不用日期时间做一个小技巧，因为日志包含填充空格而不是零的天数（如期望strptime），我使用字符串.replace()来更改将空格加倍到日期字符串

中的0

注意：filter()和split()循环中使用的for report…用于删除前导和尾随空报告，具体取决于日志文件的开始或结束方式。

Answer 3

这里有更短的东西：

users

这会将文件内容分为几个月，提取每个月的日期并搜索每个月的所有帐户。

Answer 4

您可能想尝试以下正则表达式，但这不是那么优雅：

import re

string = """
Mon Feb  1 09:12:41 GMT 2016

Rotating fax log files:

Doing login accounting:
    total      688.31
    example   401.12
    _mbsetupuser   287.10
    root         0.05
    admin     0.04

-- End of monthly output --

Tue Feb 16 14:27:21 GMT 2016

Rotating fax log files:

Doing login accounting:
    total        0.00

-- End of monthly output --

Thu Mar  3 09:37:31 GMT 2016

Rotating fax log files:

Doing login accounting:
    total      377.92
    example   377.92

-- End of monthly output --
"""
pattern = '(\w+\s+\w+\s+[\d:\s]+[A-Z]{3}\s+\d{4})[\s\S]+?((?:\w+)\s+(?:[0-9.]+))\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s*(?:((?:\w+)\s+(?:[0-9.]+)))?\s*(?:((?:\w+)\s*(?:[0-9.]+)))?'
print re.findall(pattern, string)

输出：

[('Mon Feb  1 09:12:41 GMT 2016', 'total      688.31', 'example   401.12', '_mbsetupuser   287.10', 'root         0.05', 'admin     0.04'), 
('Tue Feb 16 14:27:21 GMT 2016', 'total        0.00', '', '', '', ''), 
('Thu Mar  3 09:37:31 GMT 2016', 'total      377.92', 'example   377.92', '', '', '')]

REGEX DEMO.

正则表达式使用monthly.out进行用户会计

4 个答案: