将格式化的文本文件解析为CSV

时间:2014-03-20 16:35:48

标签: python parsing csv

我在单个文件中有几百个这样的作业指标定义,我正在尝试将其解析为格式化的.csv文档

Job Name                                                         Last Start           Last End             ST Run     Pri/Xit
________________________________________________________________ ____________________ ____________________ __ _______ ___
B9043CC_APP_DMLD_025_FR_xpabbdu1_D                               03/12/2014 18:21:32  03/12/2014 18:22:07  SU 49744331/3

  Status/[Event]  Time                 Ntry ES  ProcessTime           Machine
  --------------  --------------------- --  --  --------------------- ----------------------------------------
  [FORCE_STARTJOB]  03/12/2014 17:30:52    0  PD  03/12/2014 17:30:53
    < >
  STARTING        03/12/2014 17:30:53    1  PD  03/12/2014 17:30:53   ab-shared-batch
  RUNNING         03/12/2014 17:31:06    1  PD  03/12/2014 17:31:07   ab-shared-batch
  SUCCESS         03/12/2014 17:31:46    1  PD  03/12/2014 17:31:47
  [FORCE_STARTJOB]  03/12/2014 18:16:06    0  PD  03/12/2014 18:16:07
    < >
  STARTING        03/12/2014 18:16:07    2  PD  03/12/2014 18:16:07   ab-shared-batch-
  RUNNING         03/12/2014 18:16:19    2  PD  03/12/2014 18:16:20   ab-shared-batch-
  FAILURE         03/12/2014 18:17:02    2  PD  03/12/2014 18:17:03
  [*** ALARM ***]
    JOBFAILURE    03/12/2014 18:17:03    2  PD  03/12/2014 18:17:04
  [FORCE_STARTJOB]  03/12/2014 18:21:18    0  PD  03/12/2014 18:21:19
    < >
  STARTING        03/12/2014 18:21:19    3  PD  03/12/2014 18:21:19   ab-shared-batch-
  RUNNING         03/12/2014 18:21:32    3  PD  03/12/2014 18:21:32   ab-shared-batch-
  SUCCESS         03/12/2014 18:22:07    3  PD  03/12/2014 18:22:08

我希望我的输出看看:系统编号命令作业名称框工作名称

System Number  Job Name                           Target Machiene    Status     Actual Start Date     Actual Start Time      Actual End Date    Actual End Time
9043           B9043CC_APP_DMLD_025_FR_xpabbdu1_D ab-shared-batch    SUCCESS       03/12/2014               17:30:53            03/12/2014         17:31:47
9043           B9043CC_APP_DMLD_025_FR_xpabbdu1_D ab-shared-batch    FAILURE       03/12/2014               18:16:07            03/12/2014         18:17:03
9043           B9043CC_APP_DMLD_025_FR_xpabbdu1_D ab-shared-batch    SUCCESS       03/12/2014               18:21:19            03/12/2014         18:22:08

实际开始/结束时间&amp; actaul开始/结束日期来自“处理时间”列。我只想要上面的数据,并且不希望任何包含“----”的文本在.csv文件中的任何位置。如上所述,我在一个文件中有几百个这样的定义。

我知道python有一个内置的csv模块,我用来写入标签colums:

import csv
import sys

infile = '/home/n5acc7/test/output/testtest.csv'
f = open(infile, 'wt')
try:
    writer = csv.writer(f)
    writer.writerow( ('System Number', 'Job Name' 'Target Machiene', 'Status', 'Actual Start Date' 'Actual Start Date', 'Actual End Time', 'Actual End Date', 'Actual End Time',) )
finally:
    f.close()

但是从解析的角度来看,我不知道从哪里开始。我正在运行python 2.4.3。

2 个答案:

答案 0 :(得分:2)

解析这看起来非常简单;

一般逻辑:

read six lines (header)
get system number and batch name

until end of file:
    read five lines
    get machine name, status, start and end dates and times
    if status is FAILURE
        read two lines (clear error message)

和一些实际的代码(虽然针对Python 2.7;你必须为Python 2.4做一些反向移植,或者切换到更新的Python):

INPUT = "/home/n5acc7/test/input/batch1.log"
OUTPUT = "/home/n5acc7/test/output/testtest.csv"

LINE = "{:<6} {:34} {:18} {:10} {:10} {:10} {:10} {:10}\n"

def get_lines(n, inf):
    return [next(inf) for _ in xrange(n)]

def read_header(inf):
    head = get_n_lines(6, inf)
    job_name = head[2].split(None, 1)[0]
    system_num = job_name[1:5]
    return system_num, job_name

def read_record(inf):
    record    = get_lines(5, inf)
    startline = record[2].split()
    sd, st, name = startline[5:8]
    endline   = record[4].split()
    status    = endline[0]
    ed, et    = endline[5:7]
    # skip failure message
    if status == "FAILURE":
        get_lines(2, inf)
    return name, status, sd, st, ed, et

def parse_jobfile(fname):
    with open(fname) as inf:
        try:
            batch = read_header(inf)
            while True:
                job = read_record(inf)
                yield batch + job
        except StopIteration:
            # end of file
            pass

def main():
    with open(OUTPUT, "w") as outf:
        outf.write(LINE.format("SysNum", "Job Name", "Target Machiene", "Status", "Start Date", "Start Time", "End Date", "End Time"))
        for result in parse_jobfile(INPUT):
            outf.write(LINE.format(*result))

if __NAME__=="__MAIN__":
    main()

答案 1 :(得分:1)

你是如何使用正则表达式的? Python支持这一点。 Perl非常适合文件处理。 CSV文件可以是制表符或逗号分隔(格式有一些差异),所以如果你有一个文件句柄,它是一种非常容易写入的格式。语言不必限于其CSV功能,只要您对它感到满意,或者它的解析效率很高。就正则表达式而言,这里有一些关于intros的链接(如果你确定你的方法时遇到更多特定的解析方案,可以更新它来解决它们):

Python re

perlreref还有更多的Perl,例如:

perlre

Understand basic Regex