我在单个文件中有几百个这样的作业指标定义,我正在尝试将其解析为格式化的.csv文档
Job Name Last Start Last End ST Run Pri/Xit
________________________________________________________________ ____________________ ____________________ __ _______ ___
B9043CC_APP_DMLD_025_FR_xpabbdu1_D 03/12/2014 18:21:32 03/12/2014 18:22:07 SU 49744331/3
Status/[Event] Time Ntry ES ProcessTime Machine
-------------- --------------------- -- -- --------------------- ----------------------------------------
[FORCE_STARTJOB] 03/12/2014 17:30:52 0 PD 03/12/2014 17:30:53
< >
STARTING 03/12/2014 17:30:53 1 PD 03/12/2014 17:30:53 ab-shared-batch
RUNNING 03/12/2014 17:31:06 1 PD 03/12/2014 17:31:07 ab-shared-batch
SUCCESS 03/12/2014 17:31:46 1 PD 03/12/2014 17:31:47
[FORCE_STARTJOB] 03/12/2014 18:16:06 0 PD 03/12/2014 18:16:07
< >
STARTING 03/12/2014 18:16:07 2 PD 03/12/2014 18:16:07 ab-shared-batch-
RUNNING 03/12/2014 18:16:19 2 PD 03/12/2014 18:16:20 ab-shared-batch-
FAILURE 03/12/2014 18:17:02 2 PD 03/12/2014 18:17:03
[*** ALARM ***]
JOBFAILURE 03/12/2014 18:17:03 2 PD 03/12/2014 18:17:04
[FORCE_STARTJOB] 03/12/2014 18:21:18 0 PD 03/12/2014 18:21:19
< >
STARTING 03/12/2014 18:21:19 3 PD 03/12/2014 18:21:19 ab-shared-batch-
RUNNING 03/12/2014 18:21:32 3 PD 03/12/2014 18:21:32 ab-shared-batch-
SUCCESS 03/12/2014 18:22:07 3 PD 03/12/2014 18:22:08
我希望我的输出看看:系统编号命令作业名称框工作名称
System Number Job Name Target Machiene Status Actual Start Date Actual Start Time Actual End Date Actual End Time
9043 B9043CC_APP_DMLD_025_FR_xpabbdu1_D ab-shared-batch SUCCESS 03/12/2014 17:30:53 03/12/2014 17:31:47
9043 B9043CC_APP_DMLD_025_FR_xpabbdu1_D ab-shared-batch FAILURE 03/12/2014 18:16:07 03/12/2014 18:17:03
9043 B9043CC_APP_DMLD_025_FR_xpabbdu1_D ab-shared-batch SUCCESS 03/12/2014 18:21:19 03/12/2014 18:22:08
实际开始/结束时间&amp; actaul开始/结束日期来自“处理时间”列。我只想要上面的数据,并且不希望任何包含“----”的文本在.csv文件中的任何位置。如上所述,我在一个文件中有几百个这样的定义。
我知道python有一个内置的csv模块,我用来写入标签colums:
import csv
import sys
infile = '/home/n5acc7/test/output/testtest.csv'
f = open(infile, 'wt')
try:
writer = csv.writer(f)
writer.writerow( ('System Number', 'Job Name' 'Target Machiene', 'Status', 'Actual Start Date' 'Actual Start Date', 'Actual End Time', 'Actual End Date', 'Actual End Time',) )
finally:
f.close()
但是从解析的角度来看,我不知道从哪里开始。我正在运行python 2.4.3。
答案 0 :(得分:2)
解析这看起来非常简单;
一般逻辑:
read six lines (header)
get system number and batch name
until end of file:
read five lines
get machine name, status, start and end dates and times
if status is FAILURE
read two lines (clear error message)
和一些实际的代码(虽然针对Python 2.7;你必须为Python 2.4做一些反向移植,或者切换到更新的Python):
INPUT = "/home/n5acc7/test/input/batch1.log"
OUTPUT = "/home/n5acc7/test/output/testtest.csv"
LINE = "{:<6} {:34} {:18} {:10} {:10} {:10} {:10} {:10}\n"
def get_lines(n, inf):
return [next(inf) for _ in xrange(n)]
def read_header(inf):
head = get_n_lines(6, inf)
job_name = head[2].split(None, 1)[0]
system_num = job_name[1:5]
return system_num, job_name
def read_record(inf):
record = get_lines(5, inf)
startline = record[2].split()
sd, st, name = startline[5:8]
endline = record[4].split()
status = endline[0]
ed, et = endline[5:7]
# skip failure message
if status == "FAILURE":
get_lines(2, inf)
return name, status, sd, st, ed, et
def parse_jobfile(fname):
with open(fname) as inf:
try:
batch = read_header(inf)
while True:
job = read_record(inf)
yield batch + job
except StopIteration:
# end of file
pass
def main():
with open(OUTPUT, "w") as outf:
outf.write(LINE.format("SysNum", "Job Name", "Target Machiene", "Status", "Start Date", "Start Time", "End Date", "End Time"))
for result in parse_jobfile(INPUT):
outf.write(LINE.format(*result))
if __NAME__=="__MAIN__":
main()
答案 1 :(得分:1)
你是如何使用正则表达式的? Python支持这一点。 Perl非常适合文件处理。 CSV文件可以是制表符或逗号分隔(格式有一些差异),所以如果你有一个文件句柄,它是一种非常容易写入的格式。语言不必限于其CSV功能,只要您对它感到满意,或者它的解析效率很高。就正则表达式而言,这里有一些关于intros的链接(如果你确定你的方法时遇到更多特定的解析方案,可以更新它来解决它们):
perlreref还有更多的Perl,例如: