我有一些数据,其中一些列值是年度的cumsum(YTD总和)。我想更改这些列以反映差异而不是累积总和。数据如下:
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2007-12-31 2008-02-21 00:00:00 1 3131 3073 4 2007
2008-03-31 2008-05-08 00:00:00 1 1189 1482 1 2008
2009-05-07 16:00:00 2 1149 NaN 1 2008
2008-06-30 2008-08-07 00:00:00 1 2296 2493 2 2008
2009-08-18 00:00:00 3 2214 NaN 2 2008
2008-09-30 2008-11-06 00:00:00 1 3402 3763 3 2008
2009-11-07 00:00:00 3 3277 NaN 3 2008
2008-12-31 2009-02-17 16:00:00 1 NaN 4959 4 2008
2009-02-18 00:00:00 3 4202 NaN 4 2008
2010-03-21 00:00:00 5 4031 NaN 4 2008
2009-03-31 2009-05-07 16:00:00 1 942 1441 1 2009
2009-06-30 2009-08-06 00:00:00 1 1748 3017 2 2009
2009-09-30 2009-11-07 00:00:00 1 2458 4423 3 2009
2009-12-31 2010-02-24 16:00:00 1 3181 5598 4 2009
2010-03-31 2010-05-07 16:00:00 1 677 1172 1 2010
2010-06-30 2010-08-06 08:00:00 1 1392 2441 2 2010
2010-09-30 2010-11-08 16:00:00 1 1760 3150 3 2010
2010-12-31 2011-02-24 08:00:00 1 961 3946 4 2010
2011-03-31 2011-05-06 16:00:00 1 310 390 1 2011
2012-08-16 16:00:00 2 319 NaN 1 2011
2011-06-30 2011-08-09 08:00:00 1 465 730 2 2011
2012-08-16 16:00:00 2 443 NaN 2 2011
2011-09-30 2011-11-09 00:00:00 1 394 1222 3 2011
2012-11-06 16:00:00 2 411 NaN 3 2011
2011-12-31 2012-03-06 00:00:00 1 -5725 1785 4 2011
2013-03-05 00:00:00 2 -5754 NaN 4 2011
2012-03-31 2012-05-05 16:00:00 1 42 540 1 2012
2012-08-16 16:00:00 2 10 NaN 1 2012
2012-06-30 2012-08-02 16:00:00 1 -294 999 2 2012
2012-09-30 2012-11-06 16:00:00 1 -675 1785 3 2012
2012-12-31 2013-03-05 00:00:00 1 -219 2708 4 2012
所以我需要在给定的FYEARQ中根据'FQTR'和'Published'索引取得FQTR之间的差异并将其作为一个帧。 在我的尝试中,只要给定参考数据的项目没有多个值('NIY','OANCFY'),它就会起作用。
cfgtmp = cftmp.groupby('FYEARQ')
ft = dict()
for group_name, subdf in cftmp.dropna().drop_duplicates().groupby('FYEARQ'):
tmp = pd.concat([subdf.head(1), subdf.diff()]).dropna()
tmp['FQTR'] = subdf['FQTR']
tmp['FYEARQ'] = subdf['FYEARQ']
tmp['ReportNumber'] = subdf['ReportNumber']
ft.update({group_name : tmp})
print group_name
print 'differences'
print tmp
print ' '
pd.concat尝试处理季度之间的差异('FQTR')。它返回:
2007
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2007-12-31 2008-02-21 1 3131 3073 4 2007
2008
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2008-03-31 2008-05-08 1 1189 1482 1 2008
2008-06-30 2008-08-07 1 1107 1011 2 2008
2008-09-30 2008-11-06 1 1106 1270 3 2008
2009
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2009-03-31 2009-05-07 16:00:00 1 942 1441 1 2009
2009-06-30 2009-08-06 00:00:00 1 806 1576 2 2009
2009-09-30 2009-11-07 00:00:00 1 710 1406 3 2009
2009-12-31 2010-02-24 16:00:00 1 723 1175 4 2009
2010
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2010-03-31 2010-05-07 16:00:00 1 677 1172 1 2010
2010-06-30 2010-08-06 08:00:00 1 715 1269 2 2010
2010-09-30 2010-11-08 16:00:00 1 368 709 3 2010
2010-12-31 2011-02-24 08:00:00 1 -799 796 4 2010
2011
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2011-03-31 2011-05-06 16:00:00 1 310 390 1 2011
2011-06-30 2011-08-09 08:00:00 1 155 340 2 2011
2011-09-30 2011-11-09 00:00:00 1 -71 492 3 2011
2011-12-31 2012-03-06 00:00:00 1 -6119 563 4 2011
2012
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2012-03-31 2012-05-05 16:00:00 1 42 540 1 2012
2012-06-30 2012-08-02 16:00:00 1 -336 459 2 2012
2012-09-30 2012-11-06 16:00:00 1 -381 786 3 2012
2012-12-31 2013-03-05 00:00:00 1 456 923 4 2012
此解决方案的问题在于它仅对'ReportNumber'== 1
有效然后我使用pd.concat将其展平为一帧:
pd.concat([ft[f] for f in ft])
有什么建议吗?
答案 0 :(得分:1)
您的cftmp.dropna().
部分丢弃了与ReportNumber相关的数据而不是1.在您的示例数据框中,这些数据恰好为OANCFY nan
。
但是,为避免使用循环,您可以执行以下操作:使用head
获取第一个观察值,使用diff()
获取delta,然后将concat
放在一起。
In [71]:
newdf = pd.concat((df.groupby(['FYEARQ',
'ReportNumber']).head(1),
df.groupby(['FYEARQ',
'ReportNumber']).diff().dropna())).reset_index()\
.sort('Reference')\
.dropna(subset=['OANCFY'])\
.reset_index(drop=True)\
.fillna(method='pad')
newdf['FQTR'] = newdf.FQTR.groupby(newdf.FYEARQ).cumsum()
print newdf
Reference Published FQTR FYEARQ NIY OANCFY ReportNumber
0 2007-12-31 2008-02-21 00:00:00 4 2007 3131 3073 1
1 2008-03-31 2008-05-08 00:00:00 1 2008 1189 1482 1
2 2008-06-30 2008-08-07 00:00:00 2 2008 1107 1011 1
3 2008-09-30 2008-11-06 00:00:00 3 2008 1106 1270 1
4 2009-03-31 2009-05-07 16:00:00 1 2009 942 1441 1
5 2009-06-30 2009-08-06 00:00:00 2 2009 806 1576 1
6 2009-09-30 2009-11-07 00:00:00 3 2009 710 1406 1
7 2009-12-31 2010-02-24 16:00:00 4 2009 723 1175 1
8 2010-03-31 2010-05-07 16:00:00 1 2010 677 1172 1
9 2010-06-30 2010-08-06 08:00:00 2 2010 715 1269 1
10 2010-09-30 2010-11-08 16:00:00 3 2010 368 709 1
11 2010-12-31 2011-02-24 08:00:00 4 2010 -799 796 1
12 2011-03-31 2011-05-06 16:00:00 1 2011 310 390 1
13 2011-06-30 2011-08-09 08:00:00 2 2011 155 340 1
14 2011-09-30 2011-11-09 00:00:00 3 2011 -71 492 1
15 2011-12-31 2012-03-06 00:00:00 4 2011 -6119 563 1
16 2012-03-31 2012-05-05 16:00:00 1 2012 42 540 1
17 2012-06-30 2012-08-02 16:00:00 2 2012 -336 459 1
18 2012-09-30 2012-11-06 16:00:00 3 2012 -381 786 1
19 2012-12-31 2013-03-05 00:00:00 4 2012 456 923 1
答案 1 :(得分:1)
我不完全确定你的问题描述你想要什么,但我会假设以下内容:
肯定看起来就像你正在寻找的那样。
以下是如何开始:
DataFrame
对FQTR
进行排序。这将确保按季度diff
操作订单pd.DataFrame.diff
这是一个例子。首先,一些很好的复制/可处理数据:
In [156]: df = {'FQTR': {0: 4, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 1, 11: 2, 12: 3, 13: 4, 14: 1, 15: 2, 16: 3, 17: 4, 18: 1, 19: 1, 20: 2, 21: 2, 22: 3, 23: 3, 24: 4, 25: 4, 26: 1, 27: 1, 28: 2, 29: 3, 30: 4}, 'FYEARQ': {0: 2007, 1: 2008, 2: 2008, 3: 2008, 4: 2008, 5: 2008, 6: 2008, 7: 2008, 8: 2008, 9: 2008, 10: 2009, 11: 2009, 12: 2009, 13: 2009, 14: 2010, 15: 2010, 16: 2010, 17: 2010, 18: 2011, 19: 2011, 20: 2011, 21: 2011, 22: 2011, 23: 2011, 24: 2011, 25: 2011, 26: 2012, 27: 2012, 28: 2012, 29: 2012, 30: 2012}, 'NIY': {0: 3131.0, 1: 1189.0, 2: 1149.0, 3: 2296.0, 4: 2214.0, 5: 3402.0, 6: 3277.0, 7: nan, 8: 4202.0, 9: 4031.0, 10: 942.0, 11: 1748.0, 12: 2458.0, 13: 3181.0, 14: 677.0, 15: 1392.0, 16: 1760.0, 17: 961.0, 18: 310.0, 19: 319.0, 20: 465.0, 21: 443.0, 22: 394.0, 23: 411.0, 24: -5725.0, 25: -5754.0, 26: 42.0, 27: 10.0, 28: -294.0, 29: -675.0, 30: -219.0}, 'OANCFY': {0: 3073.0, 1: 1482.0, 2: nan, 3: 2493.0, 4: nan, 5: 3763.0, 6: nan, 7: 4959.0, 8: nan, 9: nan, 10: 1441.0, 11: 3017.0, 12: 4423.0, 13: 5598.0, 14: 1172.0, 15: 2441.0, 16: 3150.0, 17: 3946.0, 18: 390.0, 19: nan, 20: 730.0, 21: nan, 22: 1222.0, 23: nan, 24: 1785.0, 25: nan, 26: 540.0, 27: nan, 28: 999.0, 29: 1785.0, 30: 2708.0}, 'Published': {0: '2008-02-21 00:00:00', 1: '2008-05-08 00:00:00', 2: '2009-05-07 16:00:00', 3: '2008-08-07 00:00:00', 4: '2009-08-18 00:00:00', 5: '2008-11-06 00:00:00', 6: '2009-11-07 00:00:00', 7: '2009-02-17 16:00:00', 8: '2009-02-18 00:00:00', 9: '2010-03-21 00:00:00', 10: '2009-05-07 16:00:00', 11: '2009-08-06 00:00:00', 12: '2009-11-07 00:00:00', 13: '2010-02-24 16:00:00', 14: '2010-05-07 16:00:00', 15: '2010-08-06 08:00:00', 16: '2010-11-08 16:00:00', 17: '2011-02-24 08:00:00', 18: '2011-05-06 16:00:00', 19: '2012-08-16 16:00:00', 20: '2011-08-09 08:00:00', 21: '2012-08-16 16:00:00', 22: '2011-11-09 00:00:00', 23: '2012-11-06 16:00:00', 24: '2012-03-06 00:00:00', 25: '2013-03-05 00:00:00', 26: '2012-05-05 16:00:00', 27: '2012-08-16 16:00:00', 28: '2012-08-02 16:00:00', 29: '2012-11-06 16:00:00', 30: '2013-03-05 00:00:00'}, 'Reference': {0: '2007-12-31', 1: '2008-03-31', 2: '2008-03-31', 3: '2008-06-30', 4: '2008-06-30', 5: '2008-09-30', 6: '2008-09-30', 7: '2008-12-31', 8: '2008-12-31', 9: '2008-12-31', 10: '2009-03-31', 11: '2009-06-30', 12: '2009-09-30', 13: '2009-12-31', 14: '2010-03-31', 15: '2010-06-30', 16: '2010-09-30', 17: '2010-12-31', 18: '2011-03-31', 19: '2011-03-31', 20: '2011-06-30', 21: '2011-06-30', 22: '2011-09-30', 23: '2011-09-30', 24: '2011-12-31', 25: '2011-12-31', 26: '2012-03-31', 27: '2012-03-31', 28: '2012-06-30', 29: '2012-09-30', 30: '2012-12-31'}, 'ReportNumber': {0: 1, 1: 1, 2: 2, 3: 1, 4: 3, 5: 1, 6: 3, 7: 1, 8: 3, 9: 5, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 2, 20: 1, 21: 2, 22: 1, 23: 2, 24: 1, 25: 2, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1}}
In [165]: df.head()
Out[165]:
Reference Published ReportNumber NIY OANCFY FQTR FYEARQ
0 2007-12-31 2008-02-21 00:00:00 1 3131 3073 4 2007
1 2008-03-31 2008-05-08 00:00:00 1 1189 1482 1 2008
2 2008-03-31 2009-05-07 16:00:00 2 1149 NaN 1 2008
3 2008-06-30 2008-08-07 00:00:00 1 2296 2493 2 2008
4 2008-06-30 2009-08-18 00:00:00 3 2214 NaN 2 2008
现在应用上述逻辑:
In [166]: diffs = df.sort('FQTR').groupby(['FYEARQ',
'ReportNumber'])[['NIY', 'OANCFY']].apply(pd.DataFrame.diff)
重命名结果列:
In [177]: diffs.columns = ['%s_diff' % col for col in diffs.columns]
并且,如果需要,请将差异列加回原始df
:
In [180]: with_diffs = pd.concat([df, diffs], axis=1)
给你一些非常接近你想要的东西:
In [182]: with_diffs[['Published', 'ReportNumber', 'NIY', 'FQTR', 'FYEARQ', 'NIY_diff']]
Out[182]:
Published ReportNumber NIY FQTR FYEARQ NIY_diff
0 2008-02-21 00:00:00 1 3131 4 2007 NaN
1 2008-05-08 00:00:00 1 1189 1 2008 NaN
2 2009-05-07 16:00:00 2 1149 1 2008 NaN
3 2008-08-07 00:00:00 1 2296 2 2008 1107
4 2009-08-18 00:00:00 3 2214 2 2008 NaN
5 2008-11-06 00:00:00 1 3402 3 2008 1106
6 2009-11-07 00:00:00 3 3277 3 2008 1063
7 2009-02-17 16:00:00 1 NaN 4 2008 NaN
8 2009-02-18 00:00:00 3 4202 4 2008 925
9 2010-03-21 00:00:00 5 4031 4 2008 NaN
10 2009-05-07 16:00:00 1 942 1 2009 NaN
11 2009-08-06 00:00:00 1 1748 2 2009 806
12 2009-11-07 00:00:00 1 2458 3 2009 710
13 2010-02-24 16:00:00 1 3181 4 2009 723
14 2010-05-07 16:00:00 1 677 1 2010 NaN
15 2010-08-06 08:00:00 1 1392 2 2010 715
16 2010-11-08 16:00:00 1 1760 3 2010 368
17 2011-02-24 08:00:00 1 961 4 2010 -799
18 2011-05-06 16:00:00 1 310 1 2011 NaN
19 2012-08-16 16:00:00 2 319 1 2011 NaN
20 2011-08-09 08:00:00 1 465 2 2011 155
21 2012-08-16 16:00:00 2 443 2 2011 124
22 2011-11-09 00:00:00 1 394 3 2011 -71
23 2012-11-06 16:00:00 2 411 3 2011 -32
24 2012-03-06 00:00:00 1 -5725 4 2011 -6119
25 2013-03-05 00:00:00 2 -5754 4 2011 -6165
26 2012-05-05 16:00:00 1 42 1 2012 NaN
27 2012-08-16 16:00:00 2 10 1 2012 NaN
28 2012-08-02 16:00:00 1 -294 2 2012 -336
29 2012-11-06 16:00:00 1 -675 3 2012 -381
30 2013-03-05 00:00:00 1 -219 4 2012 456
显然,还有一些工作要做,以便整理(例如,如果你愿意,可以放弃NaN
值),但这样做是为了锻炼!