python从文件中读取行块

时间:2014-12-08 10:41:37

标签: python text-processing

我有一个脚本,我从中获得输出(我也将此输出保存到f1 = 20141202.194812_carStatus /中的文件):

---------------------------------------------
TM 05120970.01: Processing...
TM 05120970: Processing...
TM 05120970: current status Open
TM 05120970: Owner_Info.User_ref = crossi14
TM 05120970: Owner_Info.Email = Criss.Rossi@gmail.com
TM 05120970: CarModel = Nissan Micra
----------------------------------------------
TM 05157414.06: Processing...
TM 05157414: Processing...
TM 05157414: current status Open
TM 05157414: Owner_Info.User_ref = yumiao12
TM 05157414: Owner_Info.Email = Yu.Miao@gmail.com
TM 05157414: CarModel = Toyota Avensis
----------------------------------------------

我使用过:exec_cmd('cat ' + f1 + '| grep -e "CarModel = " -e "Owner_Info.User_ref = "') 但我还需要块的第一行(实际上是第二行)

TM 05157414.06: Processing...

我尝试/需要做的是,解析并获取每个块的变量中的值:

TM 05120970.01 -> car_number = 05120970.01

Owner_Info.User_ref = crossi14 -> owner_user = crossi14

CarModel = Nissan Micra -> car_model = Nissan Micra

有了这些信息,我会添加一些默认的东西,如:

priority = Unknown

我将需要将此变量作为另一个名为insert_owner_car.pl

的脚本的输入
 insert_owner_car.pl -id 05120970.01 -o owner_user="crossi14",car_model="Nissan Micra",priority="Unknown"

这是我到目前为止所做的工作,但由于我无法获得上述值,因此无法使用

#!/usr/bin/python

import itertools, commands, datetime, os, re, sys, time

inFile = open("/tmp/20141202.194812_carStatus")
outFile = open("result.txt", "w")
keepCurrentSet = False
for line in inFile:
    if line.startswith("----------------------------------------------"):
        keepCurrentSet = False
    if keepCurrentSet:
        parts = line.split(" = ")[1:]
        part=','.join(parts)
        print part
#outFile.write(parts)   
    if line.startswith("----------------------------------------------"):
        keepCurrentSet = True
inFile.close()
outFile.close()

我不知道怎么弄:05120970.01 以及如何使一个块中的所有变量能够将它们用作该其他脚本的输入

PS:我有python 2.5.1

2 个答案:

答案 0 :(得分:0)

您可以使用utility function open_chunk以块的形式处理文件:

import re
import subprocess

def open_chunk(readfunc, delimiter, chunksize=1024):
    """
    readfunc(chunksize) should return a string.
    """
    remainder = ''
    for chunk in iter(lambda: readfunc(chunksize), ''):
        pieces = re.split(delimiter, remainder + chunk)
        for piece in pieces[:-1]:
            yield piece
        remainder = pieces[-1]
    if remainder:
        yield remainder

f = open(filename, 'r')
for chunk in open_chunk(f.read, delimiter=r'-{45,}'):
    chunk = chunk.strip()
    if chunk:
        lines = chunk.splitlines()
        firstline = lines[0]
        car_number = firstline.split()[1][:-1]
        for line in lines[1:]:
            if 'Owner_Info.User_ref = ' in line:
                owner_user = line.split(" = ")[1]
            elif 'CarModel = ' in line:
                car_model =  line.split(" = ")[1]
        cmd = ['insert_owner_car.pl'
               , '-id'
               , car_number
               , '-o'
               , 'owner_user="%s"' % (owner_user, )
               , 'car_model="%s"' % (car_model, )
               , 'priority="Unknown"']
        print(' '.join(cmd))
        # subprocess.call(cmd)
f.close()

打印

insert_owner_car.pl -id 05120970.01 -o owner_user="crossi14" car_model="Nissan Micra" priority="Unknown"
insert_owner_car.pl -id 05157414.06 -o owner_user="yumiao12" car_model="Toyota Avensis" priority="Unknown"

如果您的数据文件很小,那么您可以将整个文件粘贴到字符串中,然后使用re.split将其拆分为块:

In [37]: import re

In [38]: re.split(r'-{45,}', open('data').read())
Out[38]: 
['\n\n',
 '\nTM 05120970.01: Processing...\nTM 05120970: Processing...\nTM 05120970: current status Open\nTM 05120970: Owner_Info.User_ref = crossi14\nTM 05120970: Owner_Info.Email = Criss.Rossi@gmail.com\nTM 05120970: CarModel = Nissan Micra\n',
 '\nTM 05157414.06: Processing...\nTM 05157414: Processing...\nTM 05157414: current status Open\nTM 05157414: Owner_Info.User_ref = yumiao12\nTM 05157414: Owner_Info.Email = Yu.Miao@gmail.com\nTM 05157414: CarModel = Toyota Avensis\n',
 '\n']

这可以用来代替上面的open_chunk。使用open_chunk的优点是它可以在非常大的文件上使用,当将整个文件拖入字符串并将其拆分成列表需要太多内存时。

答案 1 :(得分:0)

您应该使用re模块来提取相关信息:它是标准的,简单的和健壮的。 您还可以在块限制上显示块信息,并在文件末尾添加一个catch。

脚本将是:

import re

rnum = re.compile('\s*TM\s+([^\s:]+):.*')
ruser = re.compile('.*Owner_Info.User_ref\s*=\s*(.*)')
rmodel = re.compile('.*CarModel\s*=\s*(.*)')


def display(out, num, user, model):
    print(num, user, model)
    out.write('insert_owner_car.pl -id %s -o owner_user="%s",car_model="%s",priority="Unknown"\n' % (num, user, model))

inFile = open("/tmp/20141202.194812_carStatus")
outFile = open("result.txt", "w")
firstOfBlock = False
carnum = None
for line in inFile:
    if line.startswith("--------------------------------"):
        firstOfBlock = True
        if carnum is not None:
            display(outFile, carnum, user, model)
            carnum = None
    else:
        if firstOfBlock:
            m = rnum.match(line)
            if m is not None:
                carnum = m.group(1)
                firstOfBlock = False
        else:
            line = line.strip()
            m = ruser.match(line)
            if m is not None:
                user = m.group(1)
            else:
                m = rmodel.match(line)
                if m is not None:
                    model = m.group(1)

if carnum is not None:
    display(outFile, carnum, user, model)
    carnum = None

inFile.close()
outFile.close()

使用当前示例,输出为

05120970.01 crossi14 Nissan Micra
05157414.06 yumiao12 Toyota Avensis

和result.txt是:

insert_owner_car.pl -id 05120970.01 -o owner_user="crossi14",car_model="Nissan Micra",priority="Unknown"
insert_owner_car.pl -id 05157414.06 -o owner_user="yumiao12",car_model="Toyota Avensis",priority="Unknown"