在Pandas中汇总重采样数据

时间:2015-05-07 01:41:30

标签: python numpy pandas

我正在尝试重新学习python并使用它来帮助过滤和组织数据。我是熊猫的新手,并遇到了以下问题。我有一个传感器,可以测量坠落的物体直径和速度。此数据将使用以下格式保存到csv文件中:

Date,Time,Diameter,Velocity,BaseV

21-Sep-2013,13:51:04,0.422705,0.850142,4.880371
21-Sep-2013,14:01:37,0.505481,1.499196,4.877930
21-Sep-2013,14:18:50,0.391306,1.795166,4.880371
21-Sep-2013,14:18:50,0.407307,1.149977,4.880371
21-Sep-2013,14:18:50,0.399387,2.098552,4.880371

当物体落下的日期和时间,直径和速度就是那个,而baseVoltage是我们用于校准的值。 该仪器在亚秒级别测量,我使用熊猫将数据重新采样为5分钟间隔,而不是使用时间值的模数除法。在浏览了熊猫的食谱后,我提出了以下代码:

# Python script to open eachDrop.dat and read values into pandas.dataframe 
import math as m
import numpy as np 
import pandas as pd 
#---------------------------------------------------------------------------       
#read csv values into panda data frame
dropData=pd.read_csv('resEachDrop[RD130921.dat].txt',sep=',',header=0,index_col=0,parse_dates=[[0,1]],encoding=None,tupleize_cols=False,           infer_datetime_format=True)
#---------------------------------------------------------------------------
#resample time series to 5min intervals for Count, Mean, Min and Max
#mmmsc/s is group of np functions to apply to dropData diameter column to return aggregated columns 

mmmsc={'Mean':np.mean, 'Max':np.max, 'Min':np.min, 'Sum':np.sum,'Count':'count'}
mmms={'Mean':np.mean, 'Max':np.max, 'Min':np.min, 'Sum':np.sum}
#resample dropData at 5min increment on Diameter column using mhc
newData=dropData.resample('5Min', how={'Diameter':mmmsc,'Velocity':mmms})
print newData
#--------------------------------------------------------------------------

终端窗口的输出如下所示(我删除了一些行以节省空间):

Date_Time Diameter  Velocity     BaseV
2013-09-21 13:51:04  0.422705  0.850142  4.880371
2013-09-21 14:01:37  0.505481  1.499196  4.877930
2013-09-21 14:18:50  0.391306  1.795166  4.880371
2013-09-21 14:18:50  0.407307  1.149977  4.880371
...                       ...       ...       ...
2013-09-21 23:59:54  0.470808  0.719216  4.216309
2013-09-21 23:59:54  0.529965  1.748123  4.216309
2013-09-21 23:59:55  0.563966  1.466564  4.213867
2013-09-21 23:59:55  0.563966  1.515517  4.213867

[53740 rows x 3 columns]
                     Diameter                                             
Date_Time  Count       Max          Sum       Min      Mean

2013-09-21 13:50:00         1  0.422705     0.422705  0.422705  0.422705
2013-09-21 13:55:00         0       NaN          NaN       NaN       NaN
2013-09-21 14:00:00         1  0.505481     0.505481  0.505481  0.505481
2013-09-21 14:05:00         0       NaN          NaN       NaN       NaN
2013-09-21 14:10:00         0       NaN          NaN       NaN       NaN
2013-09-21 14:15:00         3  0.407307     1.198000  0.391306  0.399333
...                       ...       ...          ...       ...       ...
2013-09-21 21:30:00      1068  3.614623   594.918064  0.385087  0.557039
2013-09-21 21:35:00       247  4.363684   136.175383  0.384975  0.551317
2013-09-21 21:40:00       176  1.284766    92.519502  0.393808  0.525679 
2013-09-21 21:45:00       147  1.642836    79.037770  0.385874  0.537672
                     Velocity
                          Max          Sum       Min      Mean
Date_Time
2013-09-21 13:50:00  0.850142     0.850142  0.850142  0.850142
2013-09-21 13:55:00       NaN          NaN       NaN       NaN
2013-09-21 14:00:00  1.499196     1.499196  1.499196  1.499196
2013-09-21 14:05:00       NaN          NaN       NaN       NaN
2013-09-21 14:10:00       NaN          NaN       NaN       NaN
2013-09-21 14:15:00  2.098552     5.043695  1.149977  1.681232
...                       ...          ...       ...       ...
2013-09-21 21:30:00  3.040620  1589.967392  0.433960  1.488734
2013-09-21 21:35:00  3.215267   376.540780  0.425394  1.524457
2013-09-21 21:40:00  2.362207   272.548852  0.529707  1.548573
2013-09-21 21:45:00  2.285334   228.478854  0.503430  1.554278

当比较直径的总和值与由处理数据的程序计算的总和值时,我有一个巨大的错误。在搜索论坛之后,我认为这是由于numpy.sum占用了行的总和而不是类似于此问题的列: numpy.sum behaves differently on numpy.array vs pandas.DataFrame 。 我试图调整Sum':np.sum,使用axis = 0类似于此线程中的解决方案,但是我收到以下错误:

Traceback (most recent call last):  File "dropRead.py", line 12, in <module>
mmmsc={'Mean':np.mean, 'Max':np.max, 'Min':np.min, 'Sum':np.sum(axis=0),
'Count':'count'}
TypeError: sum() takes at least 1 argument (1 given)

任何人都可以了解我能做些什么才能使色谱柱正确相加?

谢谢,

肖恩

1 个答案:

答案 0 :(得分:0)

使用EdChum的建议解决了我的问题:

pd.Series.sum

而不是:

np.sum