Question

由于某些奇怪的原因，我必须以下列形式阅读日志文件：

Tue Apr  3 08:51:05 2018 foo=123 bar=321 spam=eggs msg="String with spaces in it"
Tue Apr  3 10:31:46 2018 foo=111 bar=222 spam=eggs msg="Different string with spaces"
...

我想在以下DataFrame中阅读它：

   bar  foo                       msg  spam                      time
0  321  123  String with spaces in it  eggs  Tue Apr  3 08:51:05 2018
1  222  111          Different string  eggs  Tue Apr  3 10:31:46 2018
...

每个<key>=<value>中的每一个都被赋予它自己的列＆amp;然后在开头的日期被赋予它自己的名为time的列。

是否有pandas方式处理此问题？（或仅<key>=<value>部分？）

或者至少，还有一种比正则表达式更好的方式将这一切分解为pandas可以接受的形式吗？

Answer 1

感谢@edouardtheron＆amp; amp;模块shlex。

如果您有更好的解决方案，请随时回答

但是，这就是我想出的，首先是导入库：

import shlex
import pandas as pd

创建一些示例数据：

# Example data
test_string = """
Tue Apr  3 08:51:05 2018 foo=123 bar=321 spam=eggs msg="String with spaces in it"
Tue Apr  3 10:31:46 2018 foo=111 bar=222 spam=eggs msg="Different string"
"""

创建与整行匹配的正则表达式，但将其分组为

1：开头的日期((?:[a-zA-Z]{3,4} ){2} \d \d\d:\d\d:\d\d \d{4})

2：其他所有内容(.*)

patt = re.compile('((?:[a-zA-Z]{3,4} ){2} \d \d\d:\d\d:\d\d \d{4}) (.*)')

循环测试字符串中的行并应用正则表达式，然后使用key_values

将shlex解析为字典

sers = []
for line in test_string.split('\n'):

    matt = re.match(patt, line)
    if not matt:
        # skip the empty lines
        continue
    # Extract Groups
    time, key_values = matt.groups()

    ser = pd.Series(dict(token.split('=', 1) for token in shlex.split(key_values)))
    ser['log_time'] = time
    sers.append(ser)

最后将所有行连接成一个DataFrame：

# Concat serieses into a dataframe
df = pd.concat(sers, axis=1).T
# Change the type of 'log_time' to an actual date
df['log_time'] = pd.to_datetime(df['log_time'], format='%a %b  %d %X %Y', exact=True)

这将生成以下DataFrame：

   bar  foo                       msg  spam            log_time
0  321  123  String with spaces in it  eggs 2018-04-03 08:51:05
1  222  111          Different string  eggs 2018-04-03 10:31:46

Pandas读取<key> = <value>的日志文件

1 个答案: