我有一个带有MultiIndex的数据框,如下所示:
>>> dfNew.head()
status shopping TUFNWGTP
state date
6 2003-01-03 emp 0 8155462.672158
2003-01-03 emp 0 8155462.672158
2003-01-03 emp 0 8155462.672158
2003-01-04 emp 0 1735322.527819
2003-01-04 emp 0 1735322.527819
您无法在此处看到它,但status
可以采用三个值:emp
,unemp
,NaN
。这是州 - 日级别的数据。我想加入新的状态数据,这些数据的频率不同,然后随着时间的推移汇总/分组。
>>> test['foo'].head()
state date
1 2004-01-01 1985886
2 2004-01-01 301172
4 2004-01-01 2614525
5 2004-01-01 1180409
6 2004-01-01 16098932
以下是我的工作:
dfNew = dfNew.join(test['foo'], method)
dfNew.reset_index(level=0, inplace=True)
doWhat = {'shopping' : np.sum, 'TUFNWGTP': np.sum, 'foo' : np.mean}
aggASS = dfNew.groupby(['state', pd.TimeGrouper("2AS", label='left'), 'status']).agg(doWhat)
此应:
foo
,并创建基于2年的值。 但是我得到的是:
>>> aggASS.head()
foo shopping TUFNWGTP
state date status
1 2003-01-01 emp 2007116.941176 2.910812e+12 4.500711e+09
unemp NaN 7.836728e+11 5.590089e+08
2005-01-01 emp 2062059.100000 2.026485e+12 4.440291e+09
unemp 2078869.000000 7.543956e+10 2.638597e+08
对于相同的foo
和status=emp
,请观察status=unemp
state
的值,date
的值是多少。
join
默认为how=inner
,所以这似乎是个问题。但是,如果我
>>> dfNew = dfNew.join(test['foo'], how='outer')
NotImplementedError: Index._join_level on non-unique index is not implemented
是的,state
- date
在这里不是唯一的。但据我所知,我想要的仍然有意义(不是吗?)。这里有什么有效的工作?
一个建议的解决方案是将它们作为列添加:
使用sort level
对齐数据框:
>>> dfNew.head()
status shopping TUFNWGTP
state date
1 2003-01-01 emp 0 3227364.873298
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
>>> test['foo'].head()
state date
1 2004-01-01 1985886
2004-02-01 1990082
2004-03-01 1999936
2004-04-01 2009556
2004-05-01 2009573
然后我们将第二个时间序列添加为列dfNew.append(test['foo'])
。我被建议ignore_index=True
,但我认为因为索引标签是正确的,我们不需要它。
然而,这会破坏我的Python实例。这是数据框的大小:
>>> len(test['foo'])
6864
>>> len(dfNew)
404394
dfNew
:http://pastebin.com/rJjh6ZSc test
:http://pastebin.com/Er70XD9y 答案 0 :(得分:1)
以下是我采取的一些步骤。希望这可以引导您走上生成解决方案的道路。
我重新创建了多索引数据框和您提供的时间序列:
In [118]: newdf
Out[118]:
0 1 2
state date
1 2003-01-01 emp 0 3227364.873298
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
2003-01-02 NaN 0 5834127.649776
2003-01-02 NaN 0 5834127.649776
2003-01-04 emp 2100942000 1506051.861585
2003-01-04 emp 2100942000 1506051.861585
2003-01-04 emp 5412841000 1204191.605090
2003-01-04 emp 5412841000 1204191.605090
2003-01-04 emp 5412841000 1204191.605090
2003-01-05 NaN 0 1765953.711812
2003-01-05 NaN 0 1765953.711812
2003-01-05 emp 0 1434858.212964
2003-01-05 emp 0 1434858.212964
2003-01-05 emp 0 1434858.212964
2003-01-05 emp 0 1811326.258197
2003-01-05 emp 0 1811326.258197
2003-01-05 NaN 0 1908483.149300
2003-01-05 NaN 0 1908483.149300
2003-01-06 NaN 1298934000 4190110.086256
2003-01-07 NaN 0 6241047.457860
2003-01-07 NaN 0 6241047.457860
2003-01-07 NaN 0 6241047.457860
2003-01-07 NaN 0 6241047.457860
2003-01-08 emp 715231400 4614396.137509
2003-01-08 emp 715231400 4614396.137509
2003-01-08 emp 715231400 4614396.137509
2 2013-08-01 emp 0 10571046.129186
2013-08-01 emp 0 10571046.129186
2013-08-01 emp 0 10571046.129186
2013-08-01 emp 0 10571046.129186
2013-08-27 NaN 6804297000 3376822.385266
2013-08-27 NaN 6804297000 3376822.385266
2013-09-28 NaN 0 4645591.067481
2013-09-28 NaN 0 4645591.067481
2013-09-28 NaN 0 4645591.067481
2013-09-28 NaN 0 4645591.067481
2013-09-28 NaN 0 4645591.067481
2013-09-28 NaN 0 4645591.067481
2013-10-18 emp 0 14402621.620998
2013-10-18 emp 0 14402621.620998
2013-11-02 unemp 0 7778017.482167
2013-11-02 unemp 0 7778017.482167
2013-11-02 unemp 0 7778017.482167
2013-11-09 NaN 0 2164565.290873
2013-11-09 NaN 0 2164565.290873
2013-11-10 emp 527859500 1759531.507169
2013-11-10 emp 527859500 1759531.507169
2013-11-24 emp 0 3050339.003118
2013-11-24 emp 0 3050339.003118
2013-11-24 emp 0 3050339.003118
2013-11-29 NaN 0 11224606.711441
2013-11-29 NaN 0 11224606.711441
2013-12-12 emp 0 13804339.863606
2013-12-12 emp 0 13804339.863606
2013-12-12 emp 0 13804339.863606
2013-12-12 emp 0 13804339.863606
In [120]: newfoo
Out[120]:
foo
state date
1 2004-01-01 1985886
2004-02-01 1990082
2004-03-01 1999936
2004-04-01 2009556
2004-05-01 2009573
2004-06-01 2013057
2004-07-01 2019963
2004-08-01 2015320
2004-09-01 2015103
2004-10-01 2035705
2004-11-01 2043152
2004-12-01 2041339
2005-01-01 2011219
2005-02-01 2014928
2005-03-01 2028597
2 2013-10-01 340483
2013-11-01 338445
2013-12-01 336903
2014-01-01 334565
2014-02-01 334667
2014-03-01 335922
2014-04-01 337188
2014-05-01 343958
2014-06-01 349122
2014-07-01 354911
2014-08-01 350833
2014-09-01 344849
2014-10-01 341434
2014-11-01 339866
2014-12-01 339203
我弄平了数据框和时间序列:
In [147]: flattenednewdf
Out[147]:
state date status shopping TUFNWGTP
0 1 2003-01-01 emp 0 3227364.873298
1 1 2003-01-01 NaN 0 6841114.725821
2 1 2003-01-01 NaN 0 6841114.725821
3 1 2003-01-01 NaN 0 6841114.725821
4 1 2003-01-01 NaN 0 6841114.725821
5 1 2003-01-01 NaN 0 6841114.725821
6 1 2003-01-02 NaN 0 5834127.649776
7 1 2003-01-02 NaN 0 5834127.649776
8 1 2003-01-04 emp 2100942000 1506051.861585
9 1 2003-01-04 emp 2100942000 1506051.861585
10 1 2003-01-04 emp 5412841000 1204191.605090
11 1 2003-01-04 emp 5412841000 1204191.605090
12 1 2003-01-04 emp 5412841000 1204191.605090
13 1 2003-01-05 NaN 0 1765953.711812
14 1 2003-01-05 NaN 0 1765953.711812
15 1 2003-01-05 emp 0 1434858.212964
16 1 2003-01-05 emp 0 1434858.212964
17 1 2003-01-05 emp 0 1434858.212964
18 1 2003-01-05 emp 0 1811326.258197
19 1 2003-01-05 emp 0 1811326.258197
20 1 2003-01-05 NaN 0 1908483.149300
21 1 2003-01-05 NaN 0 1908483.149300
22 1 2003-01-06 NaN 1298934000 4190110.086256
23 1 2003-01-07 NaN 0 6241047.457860
24 1 2003-01-07 NaN 0 6241047.457860
25 1 2003-01-07 NaN 0 6241047.457860
26 1 2003-01-07 NaN 0 6241047.457860
27 1 2003-01-08 emp 715231400 4614396.137509
28 1 2003-01-08 emp 715231400 4614396.137509
29 1 2003-01-08 emp 715231400 4614396.137509
30 2 2013-08-01 emp 0 10571046.129186
31 2 2013-08-01 emp 0 10571046.129186
32 2 2013-08-01 emp 0 10571046.129186
33 2 2013-08-01 emp 0 10571046.129186
34 2 2013-08-27 NaN 6804297000 3376822.385266
35 2 2013-08-27 NaN 6804297000 3376822.385266
36 2 2013-09-28 NaN 0 4645591.067481
37 2 2013-09-28 NaN 0 4645591.067481
38 2 2013-09-28 NaN 0 4645591.067481
39 2 2013-09-28 NaN 0 4645591.067481
40 2 2013-09-28 NaN 0 4645591.067481
41 2 2013-09-28 NaN 0 4645591.067481
42 2 2013-10-18 emp 0 14402621.620998
43 2 2013-10-18 emp 0 14402621.620998
44 2 2013-11-02 unemp 0 7778017.482167
45 2 2013-11-02 unemp 0 7778017.482167
46 2 2013-11-02 unemp 0 7778017.482167
47 2 2013-11-09 NaN 0 2164565.290873
48 2 2013-11-09 NaN 0 2164565.290873
49 2 2013-11-10 emp 527859500 1759531.507169
50 2 2013-11-10 emp 527859500 1759531.507169
51 2 2013-11-24 emp 0 3050339.003118
52 2 2013-11-24 emp 0 3050339.003118
53 2 2013-11-24 emp 0 3050339.003118
54 2 2013-11-29 NaN 0 11224606.711441
55 2 2013-11-29 NaN 0 11224606.711441
56 2 2013-12-12 emp 0 13804339.863606
57 2 2013-12-12 emp 0 13804339.863606
58 2 2013-12-12 emp 0 13804339.863606
59 2 2013-12-12 emp 0 13804339.863606
In [143]: flattenedfoo
Out[143]:
state date foo
0 1 2004-01-01 1985886
1 1 2004-02-01 1990082
2 1 2004-03-01 1999936
3 1 2004-04-01 2009556
4 1 2004-05-01 2009573
5 1 2004-06-01 2013057
6 1 2004-07-01 2019963
7 1 2004-08-01 2015320
8 1 2004-09-01 2015103
9 1 2004-10-01 2035705
10 1 2004-11-01 2043152
11 1 2004-12-01 2041339
12 1 2005-01-01 2011219
13 1 2005-02-01 2014928
14 1 2005-03-01 2028597
15 2 2013-10-01 340483
16 2 2013-11-01 338445
17 2 2013-12-01 336903
18 2 2014-01-01 334565
19 2 2014-02-01 334667
20 2 2014-03-01 335922
21 2 2014-04-01 337188
22 2 2014-05-01 343958
23 2 2014-06-01 349122
24 2 2014-07-01 354911
25 2 2014-08-01 350833
26 2 2014-09-01 344849
27 2 2014-10-01 341434
28 2 2014-11-01 339866
29 2 2014-12-01 339203
我将时间序列附加到数据框。我在底部留下了行数和列数,因此您可以根据您提供的示例验证这是正确的数据框大小:
In [149]: final_df
Out[149]:
TUFNWGTP date foo shopping state status
0 3227364.873298 2003-01-01 NaN 0 1 emp
1 6841114.725821 2003-01-01 NaN 0 1 NaN
2 6841114.725821 2003-01-01 NaN 0 1 NaN
3 6841114.725821 2003-01-01 NaN 0 1 NaN
4 6841114.725821 2003-01-01 NaN 0 1 NaN
5 6841114.725821 2003-01-01 NaN 0 1 NaN
6 5834127.649776 2003-01-02 NaN 0 1 NaN
7 5834127.649776 2003-01-02 NaN 0 1 NaN
8 1506051.861585 2003-01-04 NaN 2100942000 1 emp
9 1506051.861585 2003-01-04 NaN 2100942000 1 emp
10 1204191.605090 2003-01-04 NaN 5412841000 1 emp
11 1204191.605090 2003-01-04 NaN 5412841000 1 emp
12 1204191.605090 2003-01-04 NaN 5412841000 1 emp
13 1765953.711812 2003-01-05 NaN 0 1 NaN
14 1765953.711812 2003-01-05 NaN 0 1 NaN
15 1434858.212964 2003-01-05 NaN 0 1 emp
16 1434858.212964 2003-01-05 NaN 0 1 emp
17 1434858.212964 2003-01-05 NaN 0 1 emp
18 1811326.258197 2003-01-05 NaN 0 1 emp
19 1811326.258197 2003-01-05 NaN 0 1 emp
20 1908483.149300 2003-01-05 NaN 0 1 NaN
21 1908483.149300 2003-01-05 NaN 0 1 NaN
22 4190110.086256 2003-01-06 NaN 1298934000 1 NaN
23 6241047.457860 2003-01-07 NaN 0 1 NaN
24 6241047.457860 2003-01-07 NaN 0 1 NaN
25 6241047.457860 2003-01-07 NaN 0 1 NaN
26 6241047.457860 2003-01-07 NaN 0 1 NaN
27 4614396.137509 2003-01-08 NaN 715231400 1 emp
28 4614396.137509 2003-01-08 NaN 715231400 1 emp
29 4614396.137509 2003-01-08 NaN 715231400 1 emp
.. ... ... ... ... ... ...
0 NaN 2004-01-01 1985886 NaN 1 NaN
1 NaN 2004-02-01 1990082 NaN 1 NaN
2 NaN 2004-03-01 1999936 NaN 1 NaN
3 NaN 2004-04-01 2009556 NaN 1 NaN
4 NaN 2004-05-01 2009573 NaN 1 NaN
5 NaN 2004-06-01 2013057 NaN 1 NaN
6 NaN 2004-07-01 2019963 NaN 1 NaN
7 NaN 2004-08-01 2015320 NaN 1 NaN
8 NaN 2004-09-01 2015103 NaN 1 NaN
9 NaN 2004-10-01 2035705 NaN 1 NaN
10 NaN 2004-11-01 2043152 NaN 1 NaN
11 NaN 2004-12-01 2041339 NaN 1 NaN
12 NaN 2005-01-01 2011219 NaN 1 NaN
13 NaN 2005-02-01 2014928 NaN 1 NaN
14 NaN 2005-03-01 2028597 NaN 1 NaN
15 NaN 2013-10-01 340483 NaN 2 NaN
16 NaN 2013-11-01 338445 NaN 2 NaN
17 NaN 2013-12-01 336903 NaN 2 NaN
18 NaN 2014-01-01 334565 NaN 2 NaN
19 NaN 2014-02-01 334667 NaN 2 NaN
20 NaN 2014-03-01 335922 NaN 2 NaN
21 NaN 2014-04-01 337188 NaN 2 NaN
22 NaN 2014-05-01 343958 NaN 2 NaN
23 NaN 2014-06-01 349122 NaN 2 NaN
24 NaN 2014-07-01 354911 NaN 2 NaN
25 NaN 2014-08-01 350833 NaN 2 NaN
26 NaN 2014-09-01 344849 NaN 2 NaN
27 NaN 2014-10-01 341434 NaN 2 NaN
28 NaN 2014-11-01 339866 NaN 2 NaN
29 NaN 2014-12-01 339203 NaN 2 NaN
[90 rows x 6 columns]
构建时间仓对我来说是新的,但要使用您提供的方法,我必须将索引设置回日期列。我创建了一个新的数据框,因为很多这个过程都是实验性的,我不想重建旧的数据框:
final_df_2 = final_df.set_index(['date'])
从这一点开始,您应该可以进行任何您想要的计算。我根据你的代码在下面运行了一些,但问题是我们选择性地分组,所以结果看起来很奇怪:
In [187]: doWhat = {'shopping' : np.sum, 'TUFNWGTP': np.sum, 'foo' : np.mean}
In [188]: aggASS = final_df_2.groupby([pd.TimeGrouper("2AS", label='left')]).agg(doWhat)
In [189]: aggASS
Out[189]:
foo shopping TUFNWGTP
date
2003-01-01 2014889.333333 23885035200 1.139995e+08
2005-01-01 2018248.000000 NaN NaN
2013-01-01 341489.933333 14664313000 2.237165e+08
In [190]: aggASS = final_df_2.groupby(['state', pd.TimeGrouper("2AS", label='left'), 'status']).agg(doWhat)
In [191]: aggASS
Out[191]:
foo shopping TUFNWGTP
state date status
1 2003-01-01 emp NaN 22586101200 3.162246e+07
2 2013-01-01 emp NaN 1055719000 1.389769e+08
unemp NaN 0 2.333405e+07
我使用cut方法阅读了另一篇关于分组的帖子。你可以在这里阅读 - Grouping data by value ranges。我想你可以使用datetime对象操作来构建2年的桶。
答案 1 :(得分:1)
这是@ kennes913答案的相关部分,仅供未来访问者概述:
# flatten the data frames. For overview, just select one column each
df1flat = df.reset_index()[['state', 'date', 'TUFNWGTP']]
df2flat = df_emp.reset_index()[['state', 'date', 'foo']]
# the "merge"
X = df1flat.append(df2flat)
# now, recover the original data frames:
test1 = X.loc[np.isnan(X.foo) == False, ['state', 'date', 'foo']]
# fix dtype which was lost in the merge
test1['state'] = test1['state'].astype(int)
test2 = X.loc[np.isnan(X.TUCASEID) == False, ['state', 'date', 'TUFNWGTP']]
# check if nothing was lost:
print assert_frame_equal(bar, test1) # output: None
print assert_frame_equal(foo, test2) # output: None