我有一个超过一百万行的数据集。遗憾的是,柱子也不那么均匀。不幸的是,我正在寻找的信息有时会出现在不同的栏目中。我可以以某种方式过滤掉并只将其写入输出csv吗?
数据集由类似的字符串组成:
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
我想在输出csv中写什么:
Date;Request(in ms)
07.11.2016;20;
07.11.2016;332;
07.11.2016;7292;
07.11.2016;3213;
07.11.2016;435;
日期的正则表达式:
(0[1-9]|[12][0-9]|3[01])\.(0[1-9]|1[012])\.(19|20)\d\d
我的请求持续时间的正则表达式:
[\0-9]+[ s]
我的代码:
import re
import sys
from glob import glob
with open('output.csv', 'a') as combinedFile:
combinedFile.write('Date;Request(in ms)\n') # Headers
for eachFile in glob('*.csv'):
if eachFile == 'C:/x/x/x/x/*.csv':
pass
else:
count = 0
for line in open(eachFile, 'r'):
if count != 0:
combinedFile.write(line)
count = 1
我正在寻找全球解决方案,因为不幸的是,结构化解决方案不起作用。请求的字符串有时位于第2列,有时位于第3列。
我无法在此显示数据集,因为它是一个日志文件并包含个人信息。
我提前感谢您的帮助!
答案 0 :(得分:0)
you_data.csv文件:
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
07.11.2016 23:20:37 Request completed in 20 ms. Request from 11.1.1.111 action=GetContent&Reference=311.1.1.111&OutputEncoding=UTF8 (11.1.1.111)
代码:
with open('you_data.csv') as f:
lines = f.readlines()
for line in lines:
split_line = line.split()
Date = split_line[0]
Request = split_line[5]
print(Date, Request)
出:
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20
07.11.2016 20