将.txt文件内容分隔为.csv文件中的多个单元格

时间:2017-07-21 07:08:32

标签: python csv parsing text

我正在使用Python 2.7,我有一个像这样的txt文件,我用python打开它:

TIME    FLIGHT  FROM    AIRLINE AIRCRAFT        STATUS
8:40 AM LH1334  
Frankfurt (FRA)
Lufthansa   A320 (D-AIPP)   
Landed 8:40 AM
8:45 AM OK786   
Prague (PRG)
Czech Airlines  AT45 (OK-KFP)   
Landed 8:32 AM

我想以正确的模式将它导出到csv到6列(时间,飞行,从,航空,飞机,状态),我想得到这个:

TIME            FLIGHT  FROM            AIRLINE         AIRCRAFT      STATUS
Jul 21 8:40 AM  LH1334  Frankfurt (FRA) Lufthansa   A320 (D-AIPP) Landed 8:40 AM
...

对我来说有点困难,因为连续有多个单词,所以我没有任何有用的想法,我怎么能达到这种形式。

我的代码:

import unicodecsv as csv
import os
import sys
import io
import time
import datetime
import pandas as pd

def to_2d(l,n):
    return [l[i:i+n] for i in range(0, len(l), n)]

f = open('proba.txt', 'r')
x = f.read()

filename=r'output.csv'

resultcsv=open(filename,"wb")
output=csv.writer(resultcsv, delimiter=';',quotechar = '"', quoting=csv.QUOTE_NONNUMERIC, encoding='latin-1')

maindatatable = to_2d(x, 6)
print maindatatable
output.writerows(x)

resultcsv.close()

1 个答案:

答案 0 :(得分:0)

看起来他们分为4行。

我们可以处理第一行

8:40 AM LH1334

如下:

import re

matches = re.match('(\d{1,2}:\d{2} [APM]{2}) (\w+\d+)', line)
time = matches.group(1)
flight = matches.group(2)

编辑:这一点太过分了。有一个标签将它们分开,所以它实际上很容易:

time, flight = line.split('\t')

第二行:

Frankfurt (FRA)

很简单:

from_ = line

第三行:

Lufthansa   A320 (D-AIPP)

可以处理:

airline, aircraft = line.split('\t')

第四行:

Landed 8:40 AM

也很简单:

status = line

总而言之,您可以分别以四行为单位处理它们:

from itertools import islice

with open('my.txt') as f:
    header = f.readline()  # skip header

    while True:
        # read four lines
        lines = list(islice(f, 4))
        if len(lines) < 4:
            break

        time, flight = lines[0].split('\t')
        from_ = lines[1]
        airline, aircraft = lines[2].split('\t')
        status = lines[3]

        # Output a row into your csv file here