Question

从这样的事情开始：

from pandas import DataFrame
time = np.array(('2015-08-01T00:00:00','2015-08-01T12:00:00'),dtype='datetime64[ns]')
heat_index = np.array([101,103])
air_temperature = np.array([96,95])

df = DataFrame({'heat_index':heat_index,'air_temperature':air_temperature},index=time)

为df生成此内容：

                     air_temperature    heat_index
2015-08-01 07:00:00  96                 101
2015-08-01 19:00:00  95                 103

然后每天重新采样：

df_daily = df.resample('24H',how='max')

为df_daily获取此内容：

            air_temperature     heat_index
2015-08-01  96                  103

因此，通过每24小时使用how='max' pandas重新采样进行重新采样，从每列中获取该时间段内的最大值。

但正如您可以看到查看df的{{1}}输出，当天的最高热量指数（发生在2015-08-01）与气温发生时无关同一时间。也就是说，在空气温度为95°F时引起103°的热指数。这种关联通过重新取样而丢失，我们最终会从当天的不同部分看到气温。

有没有办法只重新采样一列，并将值保留在同一索引的另一列中？所以最终结果如下：

19:00:00

我的第一个猜测是重新取样air_temperature heat_index 2015-08-01 95 103列...

heat_index

得到......

df_daily = df.resample('24H',how={'heat_index':'max'})

...然后尝试从那里做某种DataFrame.loc或DataFrame.ix，但都没有成功。关于如何在重新采样后找到相关值的任何想法（例如，找到与后来发现的最大值air_temperature 2015-08-01 103同时发生的air_temperature）？

Answer 1

这是一种方式 - .groupby(TimeGrouper())基本上是resample正在做的事情，然后聚合功能会将每个组过滤到最大值。

In [60]: (df.groupby(pd.TimeGrouper('24H'))
            .agg(lambda df: df.loc[df['heat_index'].idxmax(), :]))

Out[60]: 
            air_temperature  heat_index
2015-08-01               95         103

在保持价值关联的同时对熊猫进行重新取样

1 个答案: