从熊猫中的col字符串中提取值

时间:2020-07-06 21:46:44

标签: python pandas

我在熊猫df日志中有一个空格。

col
Sequential mode! HostOsCheck fails, so bye!
[c01][OK][HostOsCheck] Skip cji02  because it is DOWN
[c01][Stage 3] 2/3 checks passed
[c01][FAIL][HostOsCheck] Percentage working
[c01][FAIL][HostOsCheck] Percentage working
[c02][OK][ILOStatusCheck] Percentage of working 

如果字符串的单词为[OK],则表示检查通过;如果字符串为[FAIL],则表示检查失败。

我想通过为df中的相同内容创建单独的cols来提取具有check类型(带有Check的名称),群集名称和状态(通过或失败)的日志,如下所示:

col cluster Status  name
Sequential mode! HostOsCheck fails, so bye! c01 NA  HostOsCheck
[c01][OK][HostOsCheck] Skip cji02  because it is DOWN   c01 OK  HostOsCheck
[c01][Stage 3] 2/3 checks passed    c01 NA  NA
[c01][FAIL][HostOsCheck] Percentage working c01 FAIL    HostOsCheck
[c01][FAIL][HostOsCheck] Percentage working c01 FAIL    HostOsCheck
[c02][OK][ILOStatusCheck] Percentage of working c02 OK  ILOStatusCheck

字符串中可以包含任何日志消息,但是如果通过,则状态为[OK]或[FAIL],则状态为[]。支票的名称也位于[]

我知道我可以尝试使用正则表达式并使用col.str。所以尝试以下:

df['name'] = msg.str.extract(r'([\w{1,}Check])', expand = True)

但是我得到的不是完整的Check名称HostOsCheck等

    0
0   f
1   f
2   S
3   f
4   f

状态相同:

df['status'] = msg.str.extract(r'([OK|FAIL])', expand = True)

    0
0   O
1   O
2   O
3   O
4   O

编辑:

想通了。遗漏了[]的\

msg.str.extract(r'\[(\w{1,}Check)\]', expand = True)

1 个答案:

答案 0 :(得分:0)

考虑a.txt包含日志。 a.txt

[c01][OK][HostOsCheck] Skip cji02  because it is DOWN
[c01][Stage 3] 2/3 checks passed
[c01][FAIL][HostOsCheck] Percentage working
[c01][FAIL][HostOsCheck] Percentage working
[c02][OK][ILOStatusCheck] Percentage of working 

python代码

import pandas as pd
b=[]
with open('a.txt','r') as f:
    while 1:
        s=f.readlines()
        for i in s:
            a=i.split("]")
            b.append([x.replace("[","") for x in a])
        break
df=pd.DataFrame(b,columns=['Col','Status','Name','Reason'])
print(df)