所以我有这个列表:
58308.803701 132.227.127.170 50602 149.13.32.15 443 6 64
58308.815456 149.13.32.15 443 132.227.127.170 50602 6 60
58308.815524 132.227.127.170 50602 149.13.32.15 443 6 52
58308.817244 132.227.127.170 50602 149.13.32.15 443 6 57
58308.828987 149.13.32.15 443 132.227.127.170 50602 6 52
58308.829133 149.13.32.15 443 132.227.127.170 50602 6 57
58308.829169 132.227.127.170 50602 149.13.32.15 443 6 52
58308.912361 132.227.127.170 50603 86.4.136.93 443 6 64
58308.912497 132.227.127.170 50599 94.31.112.216 443 6 95
58308.912568 132.227.127.170 50599 94.31.112.216 443 6 96
58308.912977 132.227.127.170 50599 94.31.112.216 443 6 847
58308.913411 132.227.127.170 50599 94.31.112.216 443 6 154
58308.913484 132.227.127.170 50599 94.31.112.216 443 6 233
....
....
....
我希望对每条相似的线(中间有相同的五列)进行分组,并在输出中显示第一列的最小值以及平均值,中位数,平均值,最小值,最大值......(所有可能的)统计指标),如下所示:
58308.803701 132.227.127.170 50602 149.13.32.15 443 6 64
58308.815456 149.13.32.15 443 132.227.127.170 50602 6 60
min of(58308.815524,58308.817244) 132.227.127.170 50602 149.13.32.15 443 6 min/max/avg/...of(52,57)
min of(58308.828987,58308.829133) 149.13.32.15 443 132.227.127.170 50602 6 min/max/avg/...of(52,57)
58308.829169 132.227.127.170 50602 149.13.32.15 443 6 52
58308.912361 132.227.127.170 50603 86.4.136.93 443 6 64
min of(58308.912497,..,58308.913484) 132.227.127.170 50599 94.31.112.216 443 6 min/max/avg/...of(95,96,847,154,233)
....
....
....
所以这是我到目前为止编写的代码并尝试使其工作:
from itertools import groupby
import re
import numpy as np
tstFile=open("output","w+")
with open('dataInput','r') as d:
f1 = ([x for x in line.split()] for line in d)
for a,b in groupby(f1,key=lambda x:x[1:6]):
tstFile.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\n" %(min(x[0] for x in b)),min(x[6] for x in b)),max(x[6] for x in b)),np.average(x[6] for x in b)),np.mean(x[6] for x in b)),np.median(x[6] for x in b)),np.std(x[6] for x in b)))
tstFile.close()
但似乎没有任何效果,它只适用于最小值和最大值但是为了得到每个结果我只能使用一个参数......就像这样:
tstFile=open("output","w+")
with open('dataInput','r') as d:
f1 = ([x for x in line.split()] for line in d)
for a,b in groupby(f1,key=lambda x:x[1:6]):
tstFile.write("%s\n" %(min(x[6] for x in b)))
tstFile.close()
任何帮助请!
答案 0 :(得分:0)
在处理csv文件时,通常建议使用csv module。我在下面提供了一个示例代码,演示了如何解决此问题。
如果您的输入文件以制表符分隔,请更改为delimiter='\t'
并删除skipinitialspace=True
中的csv.reader
- 示例输入中没有标签,但他们可能会在复制/粘贴过程中消失了。
import csv
from itertools import groupby
import numpy as np
with open('data.csv') as in_file, open('out.csv', 'wb') as out_file:
reader = csv.reader(in_file, delimiter=' ', skipinitialspace=True)
writer = csv.writer(out_file, delimiter='\t')
for key, group in groupby(reader, key=lambda r: r[1:6]):
col0, col6 = np.array(list(group))[:, [0, 6]].transpose().astype(float)
writer.writerow([min(col0)] + key + [int(min(col6)), int(max(col6)),
np.mean(col6)])
输出(我添加了一些标签以提高可读性):
58308.803701 132.227.127.170 50602 149.13.32.15 443 6 64 64 64.0
58308.815456 149.13.32.15 443 132.227.127.170 50602 6 60 60 60.0
58308.815524 132.227.127.170 50602 149.13.32.15 443 6 52 57 54.5
58308.828987 149.13.32.15 443 132.227.127.170 50602 6 52 57 54.5
58308.829169 132.227.127.170 50602 149.13.32.15 443 6 52 52 52.0
58308.912361 132.227.127.170 50603 86.4.136.93 443 6 64 64 64.0
58308.912497 132.227.127.170 50599 94.31.112.216 443 6 95 847 285.0