我有下表,这是task-spooler的输出。
人类很容易解析,但我无法将其读入熊猫DF。
有什么想法吗?
ID State Output E-Level Times(r/u/s) Command [run=1/2]
6 running /tmp/ts-out.FzVneG [l1]python infloop.py
0 finished /tmp/ts-out.ixWHm2 0 0.00/0.00/0.00 bash -c echo 1
1 finished /tmp/ts-out.ZzwS11 0 0.00/0.00/0.00 bash -c echo 1
2 finished /tmp/ts-out.GJlyge 2 0.00/0.00/0.00 bash -c
4 finished /tmp/ts-out.lIVMYH 2 0.00/0.00/0.00 bash -c -h
5 finished /tmp/ts-out.8EKHy1 -1 141.23/0.00/0.00 python infloop.py
3 finished /tmp/ts-out.lBr4Wy -1 2545.36/0.00/0.02 bash -c python infloop.py
7 finished /tmp/ts-out.kxCczi 2 0.01/0.00/0.00 bash -c
8 finished /tmp/ts-out.3VkfNh 0 0.00/0.00/0.00 echo
9 finished /tmp/ts-out.8ewxzl 0 0.01/0.00/0.00 echo
10 finished /tmp/ts-out.ahSLaY 0 0.00/0.00/0.00 bash -c echo $GPUID
11 finished /a/home/cc/cs/yuvval/tmp/ts-out.3dpaBO 0 0.00/0.00/0.00 bash -c ls
12 finished /tmp/ts-out.ADWkve 0 0.00/0.00/0.00 bash -c ls
13 finished /a/home/cc/cs/yuvval/tmp/ts-out.xm0jtn -1 130.67/0.00/0.02 bash -c python infloop.py
14 finished /tmp/ts-out.HxBqkm 0 0.00/0.00/0.00 bash -c echo 11
15 finished /tmp/ts-out.ERNuaE 0 0.00/0.00/0.00 bash -c echo
16 finished /tmp/ts-out.9j6hkS 0 0.00/0.00/0.00 bash -c echo $GPUID
17 finished /tmp/ts-out.Y5QDNa 0 0.00/0.00/0.00 bash -c echo $GPUID
18 finished /tmp/ts-out.EIHhoX -1 0.00/0.00/0.00 %s
19 finished /tmp/ts-out.LLw2Wl -1 0.00/0.00/0.00
20 finished /tmp/ts-out.deWAJR -1 0.01/0.00/0.00 echo $GPUID
21 finished /tmp/ts-out.AdZFIf -1 0.00/0.00/0.00 echo 12
22 finished /tmp/ts-out.NBOCVv 0 0.00/0.00/0.00 echo 12
23 finished /tmp/ts-out.5WpfPu 0 0.00/0.00/0.00 echo
24 finished /tmp/ts-out.1lw4bS -1 0.00/0.00/0.00 echo
25 finished /tmp/ts-out.7MNGLQ 0 0.00/0.00/0.00 bash -c echo $GPUID
26 finished /tmp/ts-out.8FZ3on 0 0.00/0.00/0.00 bash -c echo $GPUID
我最好的尝试是:
from StringIO import StringIO as sIO
std = ... # the table text
pd.read_table(sIO(std), sep='\s+', engine='python')
错误:
ValueError: Expected 7 fields in line 2, saw 9
修改: 生成表的源代码可用。以下是生成每一行的命令。这有助于将表格读取到数据框吗?
if (p->label)
snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s[%s]"
"%s\n",
p->jobid,
jobstate,
output_filename,
p->result.errorlevel,
p->result.real_ms,
p->result.user_ms,
p->result.system_ms,
dependstr,
p->label,
p->command);
else
snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s%s\n",
p->jobid,
jobstate,
output_filename,
p->result.errorlevel,
p->result.real_ms,
p->result.user_ms,
p->result.system_ms,
dependstr,
p->command);
答案 0 :(得分:0)
这有点烦人,但由于分隔符在输出日志中不一致(有时多个空格,有时标签和最后一列通常只有一个空格),因此很难解析而无需任何额外的在使用pandas解析文件之前应用于该文件的逻辑。
我个人不喜欢在python中打开文件来修复它,然后用pandas加载它,所以我只需要在我的管道中添加一个简短的sed
命令,然后在python中加载文件(这很简单如果你正在使用linux并且是从文件加载日志文本的话。
您可以添加:
cat logfile.log | sed -r 's/\s\s+/,/g' | sed -e 's/\([[:digit:]].[[:digit:]]\{2\}\) /\1,/' > logfile.csv
然后你只需用逗号替换所有空格以及最后一个有问题的空格。 然后该文字从:
开始ID State Output E-Level Times(r/u/s) Command [run=1/2]
6 running /tmp/ts-out.FzVneG [l1]python infloop.py
0 finished /tmp/ts-out.ixWHm2 0 0.00/0.00/0.00 bash -c echo 1
1 finished /tmp/ts-out.ZzwS11 0 0.00/0.00/0.00 bash -c echo 1
2 finished /tmp/ts-out.GJlyge 2 0.00/0.00/0.00 bash -c
4 finished /tmp/ts-out.lIVMYH 2 0.00/0.00/0.00 bash -c -h
5 finished /tmp/ts-out.8EKHy1 -1 141.23/0.00/0.00 python infloop.py
3 finished /tmp/ts-out.lBr4Wy -1 2545.36/0.00/0.02 bash -c python infloop.py
7 finished /tmp/ts-out.kxCczi 2 0.01/0.00/0.00 bash -c
8 finished /tmp/ts-out.3VkfNh 0 0.00/0.00/0.00 echo
对此:
ID,State,Output,E-Level,Times(r/u/s),Command [run=1/2]
6,running,/tmp/ts-out.FzVneG,[l1]python infloop.py
0,finished,/tmp/ts-out.ixWHm2,0,0.00/0.00/0.00,bash -c echo 1
1,finished,/tmp/ts-out.ZzwS11,0,0.00/0.00/0.00,bash -c echo 1
2,finished,/tmp/ts-out.GJlyge,2,0.00/0.00/0.00,bash -c
4,finished,/tmp/ts-out.lIVMYH,2,0.00/0.00/0.00,bash -c -h
5,finished,/tmp/ts-out.8EKHy1,-1,141.23/0.00/0.00,python infloop.py
3,finished,/tmp/ts-out.lBr4Wy,-1,2545.36/0.00/0.02,bash -c python infloop.py
7,finished,/tmp/ts-out.kxCczi,2,0.01/0.00/0.00,bash -c
8,finished,/tmp/ts-out.3VkfNh,0,0.00/0.00/0.00,echo
然后将其作为CSV加载到pandas中:
import pandas as pd
my_df = pd.read_csv(my_log_file)
我很抱歉这不是一个有趣的纯python解决方案,但在我看来,bash部分使python部分更容易。