如何计算cvs连续数据集的平均值/ min?

时间:2016-04-10 12:36:34

标签: python datetime pandas group-by mean

我是python的新手,这是我的第一个问题,请为任何错误道歉。

我有一个连续测量的大csv文件(测量大约每秒,但间隔不固定)。我需要每分钟获得平均值。我发现groupby可能会帮助我这样做,但我坚持将DATE_TIME列指定为index和dtype'datetime'。 csv文件如下所示:

  

,DATE_TIME,N2O_dry
  0,2016-03-01 02:32:02.651,0.70714453962
  1,2016-03-01 02:32:03.762,0.7071444254000001
  2,2016-03-01 02:32:05.257,0.70373171894
  3,2016-03-01 02:32:05.953,0.70083729096
  4,2016-03-01 02:32:07.049,0.69760065648
  5,2016-03-01 02:32:07.928,0.6954438788699999
  6,2016-03-01 02:32:08.726,0.6874527606899999
  7,2016-03-01 02:32:10.005,0.6724201105500001
  8,2016-03-01 02:32:10.851,0.6607286568199999
  。
  。
  。
  104503,2016-03-02 08:21:18.421,0.26879397415
  104504,2016-03-02 08:21:19.532,0.26884030311
  104505,2016-03-02 08:21:20.359,0.26887979686

到目前为止,我只是成功地在数据框中读取文件并将DATE_TIME列指定为索引,并将DATE_TIME列设为dtype ='datetime64 [ns]'对象:

import pandas

df=pandas.read_csv(file,usecols=[1,'N2O_dry'])
df=df.set_index('DATE_TIME')
df=pandas.to_datetime(df.index)

然而,现在我似乎只留下了DATE_TIME列。拜托,有人可以帮助我吗?

`

2 个答案:

答案 0 :(得分:0)

我认为您可以将参数parse_datesindex_col添加到read_csv,然后将resample添加到mean(这适用于pandas 0.18.0 ):

import pandas as pd
import io

temp=u""",DATE_TIME,N2O_dry
0,2016-03-01 02:32:02.651,0.70714453962
1,2016-03-01 02:32:03.762,0.7071444254000001
2,2016-03-01 02:32:05.257,0.70373171894
3,2016-03-01 02:32:05.953,0.70083729096
4,2016-03-01 02:32:07.049,0.69760065648
5,2016-03-01 02:32:07.928,0.6954438788699999
6,2016-03-01 02:32:08.726,0.6874527606899999
7,2016-03-01 02:32:10.005,0.6724201105500001
8,2016-03-01 02:32:10.851,0.6607286568199999"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
                 usecols=[1,'N2O_dry'], 
                 parse_dates=['DATE_TIME'], 
                 index_col=['DATE_TIME'])
print df
                          N2O_dry
DATE_TIME                        
2016-03-01 02:32:02.651  0.707145
2016-03-01 02:32:03.762  0.707144
2016-03-01 02:32:05.257  0.703732
2016-03-01 02:32:05.953  0.700837
2016-03-01 02:32:07.049  0.697601
2016-03-01 02:32:07.928  0.695444
2016-03-01 02:32:08.726  0.687453
2016-03-01 02:32:10.005  0.672420
2016-03-01 02:32:10.851  0.660729

print df.resample('1Min').mean()
                     N2O_dry
DATE_TIME                   
2016-03-01 02:32:00   0.6925

答案 1 :(得分:0)

如果我理解正确,请使用

df.index = pd.to_datetime(df.index)

而不是

df = pd.to_datetime(df.index)

这应该排除仅剩下DATE_TIME列的问题。 然后你得到(在iPython中):

In [27]:df.index
Out[27]: 
DatetimeIndex(['2016-03-01 02:32:02.651000', '2016-03-01 02:32:03.762000',
               '2016-03-01 02:32:05.257000', '2016-03-01 02:32:05.953000',
               '2016-03-01 02:32:07.049000', '2016-03-01 02:32:07.928000',
               '2016-03-01 02:32:08.726000', '2016-03-01 02:32:10.005000',
               '2016-03-01 02:32:10.851000'],
              dtype='datetime64[ns]', name=u'DATE_TIME', freq=None)

但仍然:

In [26]: df
Out[26]: 
                          N2O_dry
DATE_TIME                        
2016-03-01 02:32:02.651  0.707145
2016-03-01 02:32:03.762  0.707144
2016-03-01 02:32:05.257  0.703732
2016-03-01 02:32:05.953  0.700837
2016-03-01 02:32:07.049  0.697601
2016-03-01 02:32:07.928  0.695444
2016-03-01 02:32:08.726  0.687453
2016-03-01 02:32:10.005  0.672420
2016-03-01 02:32:10.851  0.660729