在数据集上运行多个groupby操作和不同的转换函数

时间:2018-02-24 13:13:40

标签: python pandas time-series

我有以下数据集,每5秒读取一次数值。我需要对数据集进行两次操作。

  1. 从数据集计算每分钟的平均值
  2. 使用上述分钟平均值,计算每小时变化(即每分钟值和总和的差异)
  3. 实现这一目标的最佳方式是什么?

    2018-02-10 17:25:49.074206,340
    2018-02-10 17:25:54.078155,340
    2018-02-10 17:25:59.081041,340
    2018-02-10 17:26:04.085504,340
    2018-02-10 17:26:09.089500,340
    2018-02-10 17:26:14.092926,340
    2018-02-10 17:26:19.097002,340
    2018-02-10 17:26:24.101067,340
    2018-02-10 17:26:29.104451,340
    2018-02-10 17:26:34.108283,340
    2018-02-10 17:26:39.112641,340
    2018-02-10 17:26:44.115325,340
    2018-02-10 17:26:49.120067,340
    2018-02-10 17:26:54.124166,340
    2018-02-10 17:26:59.127224,340
    

    我已经查看了stackoverflow的各种帖子,其中有以下不是最优的代码,仍然norm_by_data1有错误

    import pandas as pd
    from pandas import read_csv
    from pandas import datetime
    from matplotlib import pyplot
    
    def parser(x):
            return datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f').strftime('%Y-%m-%d %H:%M')
    
    def parser1(x):
            return datetime.strptime(x, '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d %H')
    
    def norm_by_data(x):
            return x.mean()
    
    prevrow = None
    total = None
    
    def norm_by_data1(x):
            for row in x:
               total += row - prevrow
               prevrow = row
    
    series = read_csv('water_data.txt', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
    #print(series.head())
    series.groupby(level=0).apply(norm_by_data).to_csv("tmp")
    
    series1 = read_csv('tmp', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser1)
    series1.groupby(level=0).apply(norm_by_data1)
    

1 个答案:

答案 0 :(得分:0)

0)要从.csv文件获取日期时间索引,您可以执行以下操作:

df = pd.read_csv('water_data.txt', parse_dates=[0], index_col=0)

parse_dates=[0]将解析位置0中的列的日期,而index_col=0会将列0列为DataFrame index

1)您需要设置日期时间索引并运行以下代码:(如果您没有日期时间索引,请告诉我,我会告诉您如何操作)

df.resample('1Min').mean()

2)您还需要一个日期时间索引来执行此操作。

# Gets mean for every minute
ndf = df.resample('1Min').mean()

# Calculate difference from mean in actual minute from previous minute
ndf['diff'] = ndf['values'].diff(periods=1) # You might need to chain here a .abs() as well

# Produces sum of differences for a given hour
ndf['diff'].resample('1H').sum()

3)这应该用于汇总负数和正数的不同函数:

# It will throw an error if 'func()' is not defined
ndf['diff'].resample('1H').agg({'neg': [lambda x: x[x < 0].func()], 'pos': [lambda x: x[x > 0].sum()]})