groupby元素python的平均值

时间:2014-05-14 14:17:46

标签: python numpy statistics

所以我有这个列表:

58308.803701    132.227.127.170 50602   149.13.32.15      443   6   64
58308.815456    149.13.32.15    443     132.227.127.170   50602 6   60
58308.815524    132.227.127.170 50602   149.13.32.15      443   6   52
58308.817244    132.227.127.170 50602   149.13.32.15      443   6   57
58308.828987    149.13.32.15    443     132.227.127.170   50602 6   52
58308.829133    149.13.32.15    443     132.227.127.170   50602 6   57
58308.829169    132.227.127.170 50602   149.13.32.15      443   6   52
58308.912361    132.227.127.170 50603   86.4.136.93       443   6   64
58308.912497    132.227.127.170 50599   94.31.112.216     443   6   95
58308.912568    132.227.127.170 50599   94.31.112.216     443   6   96
58308.912977    132.227.127.170 50599   94.31.112.216     443   6   847
58308.913411    132.227.127.170 50599   94.31.112.216     443   6   154
58308.913484    132.227.127.170 50599   94.31.112.216     443   6   233
....
....
....

我希望对每条相似的线(中间有相同的五列)进行分组,并在输出中显示第一列的最小值以及平均值,中位数,平均值,最小值,最大值......(所有可能的)统计指标),如下所示:

58308.803701                            132.227.127.170 50602   149.13.32.15      443   6   64
58308.815456                            149.13.32.15    443     132.227.127.170   50602 6   60
min of(58308.815524,58308.817244)       132.227.127.170 50602   149.13.32.15      443   6   min/max/avg/...of(52,57)
min of(58308.828987,58308.829133)       149.13.32.15    443     132.227.127.170   50602 6   min/max/avg/...of(52,57)
58308.829169                            132.227.127.170 50602   149.13.32.15      443   6   52
58308.912361                            132.227.127.170 50603   86.4.136.93       443   6   64
min of(58308.912497,..,58308.913484)    132.227.127.170 50599   94.31.112.216     443   6   min/max/avg/...of(95,96,847,154,233)
....
....
....

所以这是我到目前为止编写的代码并尝试使其工作:

from itertools import groupby 
import re 
import numpy as np

tstFile=open("output","w+") 
with open('dataInput','r') as d:
      f1 = ([x for x in line.split()] for line in d)
      for a,b in groupby(f1,key=lambda x:x[1:6]):
          tstFile.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\n" %(min(x[0] for x in b)),min(x[6] for x in b)),max(x[6] for x in b)),np.average(x[6] for x in b)),np.mean(x[6] for x in b)),np.median(x[6] for x in b)),np.std(x[6] for x in b)))
tstFile.close()

但似乎没有任何效果,它只适用于最小值和最大值但是为了得到每个结果我只能使用一个参数......就像这样:

tstFile=open("output","w+")
with open('dataInput','r') as d:
    f1 = ([x for x in line.split()] for line in d)
    for a,b in groupby(f1,key=lambda x:x[1:6]):
        tstFile.write("%s\n" %(min(x[6] for x in b)))
tstFile.close()

任何帮助请!

1 个答案:

答案 0 :(得分:0)

在处理csv文件时,通常建议使用csv module。我在下面提供了一个示例代码,演示了如何解决此问题。

如果您的输入文件以制表符分隔,请更改为delimiter='\t'并删除skipinitialspace=True中的csv.reader - 示例输入中没有标签,但他们可能会在复制/粘贴过程中消失了。

import csv
from itertools import groupby
import numpy as np

with open('data.csv') as in_file, open('out.csv', 'wb') as out_file:
    reader = csv.reader(in_file, delimiter=' ', skipinitialspace=True)
    writer = csv.writer(out_file, delimiter='\t')
    for key, group in groupby(reader, key=lambda r: r[1:6]):
        col0, col6 = np.array(list(group))[:, [0, 6]].transpose().astype(float)
        writer.writerow([min(col0)] + key + [int(min(col6)), int(max(col6)),
                                             np.mean(col6)])    

输出(我添加了一些标签以提高可读性):

58308.803701    132.227.127.170 50602   149.13.32.15    443     6   64  64  64.0
58308.815456    149.13.32.15    443     132.227.127.170 50602   6   60  60  60.0
58308.815524    132.227.127.170 50602   149.13.32.15    443     6   52  57  54.5
58308.828987    149.13.32.15    443     132.227.127.170 50602   6   52  57  54.5
58308.829169    132.227.127.170 50602   149.13.32.15    443     6   52  52  52.0
58308.912361    132.227.127.170 50603   86.4.136.93     443     6   64  64  64.0
58308.912497    132.227.127.170 50599   94.31.112.216   443     6   95  847 285.0