我在不同的时间间隔获得了几个Pandas DataFrame。一个是每日水平:
DatetimeIndex(['2007-12-01', '2007-12-02', '2007-12-03', '2007-12-04',
'2007-12-05', '2007-12-06', '2007-12-07', '2007-12-08',
'2007-12-09', '2007-12-10',
...
'2016-08-22', '2016-08-23', '2016-08-24', '2016-08-25',
'2016-08-26', '2016-08-27', '2016-08-28', '2016-08-29',
'2016-08-30', '2016-08-31'],
dtype='datetime64[ns]', length=3197, freq=None)
其他人处于非日常水平(他们总是不如每天坚决)。例如,这是每周一次:
DatetimeIndex(['2007-01-01', '2007-01-08', '2007-01-15', '2007-01-22',
'2007-01-29', '2007-02-05', '2007-02-12', '2007-02-19',
'2007-02-26', '2007-03-05',
...
'2010-03-08', '2010-03-15', '2010-03-22', '2010-03-29',
'2010-04-05', '2010-04-12', '2010-04-19', '2010-04-26',
'2010-05-03', 'NaT'],
dtype='datetime64[ns]', name='week', length=176, freq=None)
这是每月一次:
DatetimeIndex(['2013-04-01', '2013-05-01', '2013-06-01', '2013-07-01',
'2013-08-01', '2013-09-01', '2013-10-01', '2013-11-01',
'2013-12-01', '2014-01-01', '2014-02-01', '2014-03-01',
'2014-04-01', '2014-05-01', '2014-06-01', '2014-07-01',
'2014-08-01', '2014-09-01', '2014-10-01', '2014-11-01',
'2014-12-01', '2015-01-01', '2015-02-01', '2015-03-01',
'2015-04-01', '2015-05-01', '2015-06-01', '2015-07-01',
'2015-08-01', '2015-09-01', '2015-10-01', '2015-11-01',
'2015-12-01', '2016-01-01', '2016-02-01', '2016-03-01',
'2016-04-01', '2016-05-01', '2016-06-01', '2016-07-01',
'2016-08-01'],
dtype='datetime64[ns]', name='month', freq=None)
这只是一个不规则间隔的奇怪球:
DatetimeIndex(['2014-02-14', '2014-05-08', '2014-09-19', '2014-09-24',
'2015-01-21', '2016-05-26', '2016-06-02', '2016-06-04'],
dtype='datetime64[ns]', name='date', freq=None)
我需要做的是将每日数据重新采样(求和)到其他人指定的时间间隔。因此,如果DatetimeIndex是每月一次,我需要将每日数据重新采样为每月。如果是每周一次,则应每周重新采样一次。如果它不规则,则需要匹配。我需要这个,因为我建立了这些数据的统计模型,我需要基本事实与观察值对齐。
如何让Pandas重新取样DataFrame df1
,以匹配另一个任意数据框df2
的DatetimeIndex?我一直在搜索,但我无法解决这个问题。它似乎是一个非常常见的熊猫任务,所以我必须错过一些东西。谢谢!
答案 0 :(得分:2)
考虑使用pandas DataFrame.resample():
# EXAMPLE DATA OF SEQUENTIAL DATES AND RANDOM NUMBERS
index = pd.date_range('12/01/2007', periods=3197, freq='D', dtype='datetime64[ns]')
series = pd.Series(np.random.randint(0,100, 3197), index=index)
df = pd.DataFrame({'num':series})
# num
# 2007-12-01 73
# 2007-12-02 17
# 2007-12-03 63
# 2007-12-04 72
# 2007-12-05 4
# 2007-12-06 91
# 2007-12-07 20
# 2007-12-08 99
# 2007-12-09 97
# 2007-12-10 33
wdf = df.resample('W-SAT').sum() # SATURDAY WEEK START
# num
# 2007-12-01 73
# 2007-12-08 366
# 2007-12-15 354
# 2007-12-22 302
# 2007-12-29 310
# 2008-01-05 323
# 2008-01-12 424
mdf = df.resample('MS').sum() # MONTH START
# num
# 2007-12-01 1568
# 2008-01-01 1465
# 2008-02-01 1317
# 2008-03-01 1473
# 2008-04-01 1762
# 2008-05-01 1698
# 2008-06-01 1345
对于不规则间隔,请使用DataFrame.apply()
中的自定义函数创建 enddate 列,该列将是当前行的日期间隔的结束截止日期串联(即 2015-01-01 的结束日期为 2015-01-21 在Datetimeindex系列中),使用系列过滤器计算。然后,在新的 enddate 列上运行groupby()
进行汇总聚合:
irrdt = pd.DatetimeIndex(['2014-02-14', '2014-05-08', '2014-09-19', '2014-09-24',
'2015-01-21', '2016-05-26', '2016-06-02', '2016-06-04'],
dtype='datetime64[ns]', name='date', freq=None)
def findrng(row):
ed = str(irrdt[irrdt > row['Date']].min())[0:10]
row['enddt'] = ed if ed !='NaT' else str(irrdt.max())[0:10]
return(row)
df['Date'] = df.index
df = df.apply(findrng, axis=1).groupby(['enddt']).sum()
# num
# enddt
# 2014-02-14 112143
# 2014-05-08 3704
# 2014-09-19 5958
# 2014-09-24 365
# 2015-01-21 5730
# 2016-05-26 24126
# 2016-06-02 305
# 2016-06-04 4142