我一直在研究熊猫的公开数据集,其中包含美国各州的一些空气质量统计数据。
我正在做的是汇总每个状态的度量值,而我遇到的问题是,不同的州在不同的时间段内都有可用的度量值。因此,我正在收集所有数据,如下所示:
import pandas as pd
poll = pd.read_csv('dataset.csv')
poll = poll.groupby(['State', 'Date Local']).mean()
states = pds.index.levels[0] # All the states
pds_grouped = pds.groupby(level='State')
# Iterate throuugh each of the state and aggregae monthly
for s in states:
flt = pds_grouped.get_group(s).groupby(pd.Grouper(level='Date Local', freq='M')).agg({'V1': 'mean',
'V2': 'mean',
'V3': 'mean',
'V4': 'mean'})
print(s, flt.shape, flt.index.min(), flt.index.max(), type(flt))
打印的内容如下:
Alabama (30, 4) 2013-12-31 00:00:00 2016-05-31 00:00:00 <class 'pandas.core.frame.DataFrame'>
Alaska (18, 4) 2014-07-31 00:00:00 2015-12-31 00:00:00 <class 'pandas.core.frame.DataFrame'>
Arizona (195, 4) 2000-01-31 00:00:00 2016-03-31 00:00:00 <class 'pandas.core.frame.DataFrame'>
Arkansas (111, 4) 2007-01-31 00:00:00 2016-03-31 00:00:00 <class 'pandas.core.frame.DataFrame'>
California (196, 4) 2000-01-31 00:00:00 2016-04-30 00:00:00 <class 'pandas.core.frame.DataFrame'>
Colorado (195, 4) 2000-01-31 00:00:00 2016-03-31 00:00:00 <class 'pandas.core.frame.DataFrame'>
Connecticut (117, 4) 2006-04-30 00:00:00 2015-12-31 00:00:00 <class 'pandas.core.frame.DataFrame'>
如您所见,它们都有不同数量的测量值,并且它们跨越不同的时间段。我正在尝试创建一个动画,以显示这些污染物在整个时间内的变化,如果我可以让这些数据帧跨越相同的时间段并基本上用与该时间相对应的NaNs
进行填充,将会更加容易。对于给定状态无法进行测量的时间段。我一直在看熊猫中的resample
方法,但不知道如何指定日期范围。
答案 0 :(得分:1)
尝试:
all_dates = poll.index.levels[1]
date_range = pd.date_range(all_dates.min(), all_dates.max(), freq='MS')
flt = (poll.groupby('State')
.apply(lambda x: x.reset_index(level=1)
.resample('MS', on='Date Local')
.mean()
.reindex(date_range))
)
答案 1 :(得分:0)
所以,我做到了,它似乎起作用了:
import pandas as pd
poll = pd.read_csv('dataset.csv')
poll = poll.groupby(['State', 'Date Local']).mean()
states = pds.index.levels[0] # All the states
pds_grouped = pds.groupby(level='State')
# Iterate throuugh each of the state and aggregae monthly
measures = list()
for s in states:
flt = pds_grouped.get_group(s).groupby(pd.Grouper(level='Date Local', freq='M')).agg({'V1': 'mean',
'V2': 'mean',
'V3': 'mean',
'V4': 'mean'})
min_time = flt.index.min() if min_time is None else min(min_time, flt.index.min())
max_time = flt.index.max() if max_time is None else max(max_time, flt.index.max())
measures.append(flt)
for i in range(len(measures)):
# Create a date range and reindex.
dr = pd.date_range(start=min_time, end=max_time, freq='M')
measures[i] = measures[i].reindex(dr)