使用Python对按行和列分隔的数据进行数据处理

时间:2018-10-21 15:10:11

标签: python text

我有这样的数据,它被行(日期时间行)分隔

01-Jan-1990 00:00:01 ABCD
A  abcde fghijk lmnopq
     hsjfne qqq                     # EDITED WITH ADDITONAL SPILL OVER DATA with \t
B abcde fghijk lmnopq
01-Jan-1990 00:00:05 ABCD
A ancfjhr sfjerhj egen
C etfhw3uh uhuefwh fewvjh dfeg efwbywgefb
D wrf fcwewe fvwefwe fwef
01-Jan-1990 00:00:07 ABCD
A wfw fbebwu
B fewhuf ifgiwejhifgj fijweij

希望以一种方式将其清洗,如日期时间行之后的第一个值中的A,B,C等分隔为一列,而将A,B,C之后的值分隔为另一列然后捕获日期时间并将其输入为另一列。像这样

A,abcde fghijk lmnopq hsjfne qqq, 01-Jan-1990 00:00:01 #WOULD LIKE TO COMBINE THE SPILL DATA
B,abcde fghijk lmnopq, 01-Jan-1990 00:00:01
A,ancfjhr sfjerhj egen,01-Jan-1990 00:00:05
C,etfhw3uh uhuefwh fewvjh dfeg efwbywgefb,01-Jan-1990 00:00:05
D,wrf fcwewe fvwefwe fwefe,01-Jan-1990 00:00:05

etc etc etc

如果有人可以指导我,将不胜感激。我尝试通过模式匹配来阅读,然后抓住以下几行,但无法完成。

import re
#Log Reading

log=open("IDM.txt","r")


for line in log:
    splitLine = line.split()
    iterator = iter(splitLine)
    datematch = (re.match('^(([0-9])|([0-2][0-9])|([3][0-1])- 
   (JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)-\\d{4}$',splitLine[0]))
if datematch:
    print(line)

理解上面的代码与我想要实现的代码完全不同,因此希望你们能帮助我指导并证明我已经尝试了一些东西。谢谢您的时间

已编辑:包括第三行数据,以显示第二行数据的溢出值,并在行前使用\ t制表符

2 个答案:

答案 0 :(得分:1)

使用with open()打开文件始终是一个好主意,然后您可以根据需要在列表中解析行,在我的情况下,我只是检查了行的前2个字符是数字(如果是),它存储以后要添加到所需行的值:

import csv
content = []

with open('IDM.txt','r') as f:
    lines = f.readlines()
    for idx,line in enumerate(lines):
        if line[:2].isdigit():
                date = line[:20]

        elif idx == len(lines)-1 or (line[0] != ' ' and lines[idx+1][0] != ' '):
            data = line[0] + ',' + line[1:].rstrip('\n') 
            content.append(data+ ', '+ date)  

        elif lines[idx+1][0] == ' ':
            spill = lines[idx+1].rstrip('\n').strip()
            data = line[0] + ',' + line[1:].rstrip('\n') + ' ' + spill
            content.append(data+ ', '+ date)

        else:
            pass


with open('IDMOutput.csv','w') as f:
    for line in content:
        f.write("%s\n" % line)

>>content
['A, abcde fghijk lmnopq hsjfne qqqqq, 01-Jan-1990 00:00:01',
 'B, abcde fghijk lmnopq, 01-Jan-1990 00:00:01',
 'A, ancfjhr sfjerhj egen, 01-Jan-1990 00:00:05',
 'C, etfhw3uh uhuefwh fewvjh dfeg efwbywgefb, 01-Jan-1990 00:00:05',
 'D, wrf fcwewe fvwefwe fwef, 01-Jan-1990 00:00:05',
 'A, wfw fbebwu, 01-Jan-1990 00:00:07',
 'B, fewhuf ifgiwejhifgj fijweij, 01-Jan-1990 00:00:07']

编辑:添加了rstrip以删除'\n'并包含timestamp并溢出与输出相关的更新。

答案 1 :(得分:0)

  

另一种简单的方法是使用正则表达式:Regular Expression HOWTOPrint lists in Python

  • .txt文件IDM.txt
  • 中读取
  • 使用lstrip()删除了左侧的空白
  • 创建正则表达式pattern_num来查找以数字开头的匹配行
  • 根据 OP 请求格式化的
  • log字符串
  • 将最终结果写入IDM_clean.txt

  

更新:最终解决方案为Generalization

import re


pattern_num = re.compile(r'^[0-9]') # patter we look in the string

log_list = []


#for line in file_as_list:
file_as_list = []

lines = open("IDM.txt", "r").read().split("\n")
for i, line in enumerate(lines):
    if line.startswith(" "):
        lines[i-1] = lines[+1].strip() + " " + line.lstrip()
        lines.pop(i)
    logs = '\n'.join(lines)+"\n"

file_as_list = logs.splitlines()

for l in file_as_list:
    if re.match(pattern_num, l):
        datos = l
    else:
        info = l[0] + ', ' + l[1:].lstrip()
        log_list.append(info + ', ' + datos)

        log = '\n'.join(map(str, log_list))

open("IDM_clean.txt", "w").write(log+"\n") # write to the file the result       


print("-----------------------------------")
print(type(log))
print("------------------------------------------------------------------------")
print(log)#print the desired format
print("------------------------------------------------------------------------")
Out:
----------------------------------
<class 'str'>
-----------------------------------------------------------------------
A, abcde fghijk lmnopq hsjfne qqq, 01-Jan-1990 00:00:01 ABCD
B, abcde fghijk lmnopq, 01-Jan-1990 00:00:01 ABCD
A, ancfjhr sfjerhj egen, 01-Jan-1990 00:00:05 ABCD
C, etfhw3uh uhuefwh fewvjh dfeg efwbywgefb, 01-Jan-1990 00:00:05 ABCD
D, wrf fcwewe fvwefwe fwef, 01-Jan-1990 00:00:05 ABCD
A, wfw fbebwu, 01-Jan-1990 00:00:07 ABCD
B, fewhuf ifgiwejhifgj fijweij, 01-Jan-1990 00:00:07 ABCD
-----------------------------------------------------------------------

文件中的屏幕:

A, abcde fghijk lmnopq hsjfne qqq, 01-Jan-1990 00:00:01 ABCD
B, abcde fghijk lmnopq, 01-Jan-1990 00:00:01 ABCD
A, ancfjhr sfjerhj egen, 01-Jan-1990 00:00:05 ABCD
C, etfhw3uh uhuefwh fewvjh dfeg efwbywgefb, 01-Jan-1990 00:00:05 ABCD
D, wrf fcwewe fvwefwe fwef, 01-Jan-1990 00:00:05 ABCD
A, wfw fbebwu, 01-Jan-1990 00:00:07 ABCD
B, fewhuf ifgiwejhifgj fijweij, 01-Jan-1990 00:00:07 ABCD