用熊猫阅读复杂的表格('任务假脱机程序')

时间:2017-02-13 10:21:56

标签: python pandas taskscheduler

我有下表,这是task-spooler的输出。

人类很容易解析,但我无法将其读入熊猫DF。

有什么想法吗?

ID   State      Output               E-Level  Times(r/u/s)   Command [run=1/2]
6    running    /tmp/ts-out.FzVneG                           [l1]python infloop.py
0    finished   /tmp/ts-out.ixWHm2   0        0.00/0.00/0.00 bash -c echo 1
1    finished   /tmp/ts-out.ZzwS11   0        0.00/0.00/0.00 bash -c echo 1
2    finished   /tmp/ts-out.GJlyge   2        0.00/0.00/0.00 bash -c
4    finished   /tmp/ts-out.lIVMYH   2        0.00/0.00/0.00 bash -c -h
5    finished   /tmp/ts-out.8EKHy1   -1       141.23/0.00/0.00 python infloop.py
3    finished   /tmp/ts-out.lBr4Wy   -1       2545.36/0.00/0.02 bash -c python infloop.py
7    finished   /tmp/ts-out.kxCczi   2        0.01/0.00/0.00 bash -c
8    finished   /tmp/ts-out.3VkfNh   0        0.00/0.00/0.00 echo
9    finished   /tmp/ts-out.8ewxzl   0        0.01/0.00/0.00 echo
10   finished   /tmp/ts-out.ahSLaY   0        0.00/0.00/0.00 bash -c echo $GPUID
11   finished   /a/home/cc/cs/yuvval/tmp/ts-out.3dpaBO 0        0.00/0.00/0.00 bash -c ls
12   finished   /tmp/ts-out.ADWkve   0        0.00/0.00/0.00 bash -c ls
13   finished   /a/home/cc/cs/yuvval/tmp/ts-out.xm0jtn -1       130.67/0.00/0.02 bash -c python infloop.py
14   finished   /tmp/ts-out.HxBqkm   0        0.00/0.00/0.00 bash -c echo 11
15   finished   /tmp/ts-out.ERNuaE   0        0.00/0.00/0.00 bash -c echo 
16   finished   /tmp/ts-out.9j6hkS   0        0.00/0.00/0.00 bash -c echo $GPUID
17   finished   /tmp/ts-out.Y5QDNa   0        0.00/0.00/0.00 bash -c echo $GPUID
18   finished   /tmp/ts-out.EIHhoX   -1       0.00/0.00/0.00 %s
19   finished   /tmp/ts-out.LLw2Wl   -1       0.00/0.00/0.00 
20   finished   /tmp/ts-out.deWAJR   -1       0.01/0.00/0.00 echo $GPUID
21   finished   /tmp/ts-out.AdZFIf   -1       0.00/0.00/0.00 echo 12
22   finished   /tmp/ts-out.NBOCVv   0        0.00/0.00/0.00 echo 12
23   finished   /tmp/ts-out.5WpfPu   0        0.00/0.00/0.00 echo
24   finished   /tmp/ts-out.1lw4bS   -1       0.00/0.00/0.00 echo 
25   finished   /tmp/ts-out.7MNGLQ   0        0.00/0.00/0.00 bash -c echo $GPUID
26   finished   /tmp/ts-out.8FZ3on   0        0.00/0.00/0.00 bash -c echo $GPUID

我最好的尝试是:

from StringIO import StringIO as sIO
std = ... # the table text
pd.read_table(sIO(std), sep='\s+', engine='python')

错误:

ValueError: Expected 7 fields in line 2, saw 9

修改: 生成表的源代码可用。以下是生成每一行的命令。这有助于将表格读取到数据框吗?

if (p->label)
    snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s[%s]"
            "%s\n",
            p->jobid,
            jobstate,
            output_filename,
            p->result.errorlevel,
            p->result.real_ms,
            p->result.user_ms,
            p->result.system_ms,
            dependstr,
            p->label,
            p->command);
else
    snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s%s\n",
            p->jobid,
            jobstate,
            output_filename,
            p->result.errorlevel,
            p->result.real_ms,
            p->result.user_ms,
            p->result.system_ms,
            dependstr,
            p->command);

1 个答案:

答案 0 :(得分:0)

这有点烦人,但由于分隔符在输出日志中不一致(有时多个空格,有时标签和最后一列通常只有一个空格),因此很难解析而无需任何额外的在使用pandas解析文件之前应用于该文件的逻辑。 我个人不喜欢在python中打开文件来修复它,然后用pandas加载它,所以我只需要在我的管道中添加一个简短的sed命令,然后在python中加载文件(这很简单如果你正在使用linux并且是从文件加载日志文本的话。 您可以添加:

cat logfile.log | sed -r 's/\s\s+/,/g' | sed -e 's/\([[:digit:]].[[:digit:]]\{2\}\) /\1,/' > logfile.csv

然后你只需用逗号替换所有空格以及最后一个有问题的空格。 然后该文字从:

开始
ID   State      Output               E-Level  Times(r/u/s)   Command [run=1/2]
6    running    /tmp/ts-out.FzVneG                           [l1]python infloop.py
0    finished   /tmp/ts-out.ixWHm2   0        0.00/0.00/0.00 bash -c echo 1
1    finished   /tmp/ts-out.ZzwS11   0        0.00/0.00/0.00 bash -c echo 1
2    finished   /tmp/ts-out.GJlyge   2        0.00/0.00/0.00 bash -c
4    finished   /tmp/ts-out.lIVMYH   2        0.00/0.00/0.00 bash -c -h
5    finished   /tmp/ts-out.8EKHy1   -1       141.23/0.00/0.00 python infloop.py
3    finished   /tmp/ts-out.lBr4Wy   -1       2545.36/0.00/0.02 bash -c python infloop.py
7    finished   /tmp/ts-out.kxCczi   2        0.01/0.00/0.00 bash -c
8    finished   /tmp/ts-out.3VkfNh   0        0.00/0.00/0.00 echo

对此:

ID,State,Output,E-Level,Times(r/u/s),Command [run=1/2]
6,running,/tmp/ts-out.FzVneG,[l1]python infloop.py
0,finished,/tmp/ts-out.ixWHm2,0,0.00/0.00/0.00,bash -c echo 1
1,finished,/tmp/ts-out.ZzwS11,0,0.00/0.00/0.00,bash -c echo 1
2,finished,/tmp/ts-out.GJlyge,2,0.00/0.00/0.00,bash -c
4,finished,/tmp/ts-out.lIVMYH,2,0.00/0.00/0.00,bash -c -h
5,finished,/tmp/ts-out.8EKHy1,-1,141.23/0.00/0.00,python infloop.py
3,finished,/tmp/ts-out.lBr4Wy,-1,2545.36/0.00/0.02,bash -c python infloop.py
7,finished,/tmp/ts-out.kxCczi,2,0.01/0.00/0.00,bash -c
8,finished,/tmp/ts-out.3VkfNh,0,0.00/0.00/0.00,echo

然后将其作为CSV加载到pandas中:

import pandas as pd
my_df = pd.read_csv(my_log_file)

我很抱歉这不是一个有趣的纯python解决方案,但在我看来,bash部分使python部分更容易。