写入输出csv只匹配python中的正则表达式来自不同列的字符串

时间:2016-12-16 12:03:19

标签: python csv

我有一个超过一百万行的数据集。遗憾的是,柱子也不那么均匀。不幸的是,我正在寻找的信息有时会出现在不同的栏目中。我可以以某种方式过滤掉并只将其写入输出csv吗?

数据集由类似的字符串组成:

07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)

我想在输出csv中写什么:

Date;Request(in ms)   
07.11.2016;20;
07.11.2016;332;
07.11.2016;7292;
07.11.2016;3213;
07.11.2016;435;

日期的正则表达式:

(0[1-9]|[12][0-9]|3[01])\.(0[1-9]|1[012])\.(19|20)\d\d

我的请求持续时间的正则表达式:

[\0-9]+[ s]

我的代码:

import re
import sys

from glob import glob

with open('output.csv', 'a') as combinedFile:
    combinedFile.write('Date;Request(in ms)\n') # Headers
    for eachFile in glob('*.csv'):
        if eachFile == 'C:/x/x/x/x/*.csv':
            pass
        else:
            count = 0
            for line in open(eachFile, 'r'):
                if count != 0:
                    combinedFile.write(line)
                count = 1

我正在寻找全球解决方案,因为不幸的是,结构化解决方案不起作用。请求的字符串有时位于第2列,有时位于第3列。

我无法在此显示数据集,因为它是一个日志文件并包含个人信息。

我提前感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

you_data.csv文件:

07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016  23:20:37    Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)

代码:

with open('you_data.csv') as f:
    lines = f.readlines()
    for line in lines:
        split_line = line.split()
        Date = split_line[0]
        Request = split_line[5]
        print(Date, Request)

出:

07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20