熊猫:外部加入非独特指数

时间:2015-03-11 14:23:04

标签: python pandas

我有一个带有MultiIndex的数据框,如下所示:

>>> dfNew.head()
                 status  shopping        TUFNWGTP
state date                                       
6     2003-01-03    emp         0  8155462.672158
      2003-01-03    emp         0  8155462.672158
      2003-01-03    emp         0  8155462.672158
      2003-01-04    emp         0  1735322.527819
      2003-01-04    emp         0  1735322.527819

您无法在此处看到它,但status可以采用三个值:empunempNaN。这是州 - 日级别的数据。我想加入新的状态数据,这些数据的频率不同,然后随着时间的推移汇总/分组。

>>> test['foo'].head()
state  date      
1      2004-01-01     1985886
2      2004-01-01      301172
4      2004-01-01     2614525
5      2004-01-01     1180409
6      2004-01-01    16098932

加入没有how = inner

以下是我的工作:

dfNew = dfNew.join(test['foo'], method)
dfNew.reset_index(level=0, inplace=True)
doWhat = {'shopping' : np.sum, 'TUFNWGTP': np.sum, 'foo' : np.mean}
aggASS = dfNew.groupby(['state', pd.TimeGrouper("2AS", label='left'), 'status']).agg(doWhat)

  • 从每个日期时间组合的其他数据库加入foo,并创建基于2年的值。

但是我得到的是:

>>> aggASS.head()
                                    foo      shopping      TUFNWGTP
state date       status                                            
1     2003-01-01 emp     2007116.941176  2.910812e+12  4.500711e+09
                 unemp              NaN  7.836728e+11  5.590089e+08
      2005-01-01 emp     2062059.100000  2.026485e+12  4.440291e+09
                 unemp   2078869.000000  7.543956e+10  2.638597e+08

对于相同的foostatus=emp,请观察status=unemp state的值,date的值是多少。

加入how = inner

join默认为how=inner,所以这似乎是个问题。但是,如果我

>>> dfNew = dfNew.join(test['foo'], how='outer')
NotImplementedError: Index._join_level on non-unique index is not implemented

是的,state - date在这里不是唯一的。但据我所知,我想要的仍然有意义(不是吗?)。这里有什么有效的工作?

建议的解决方案:附加为列

一个建议的解决方案是将它们作为列添加:

使用sort level对齐数据框:

>>> dfNew.head()
                 status  shopping        TUFNWGTP
state date                                       
1     2003-01-01    emp         0  3227364.873298
      2003-01-01    NaN         0  6841114.725821
      2003-01-01    NaN         0  6841114.725821
      2003-01-01    NaN         0  6841114.725821
      2003-01-01    NaN         0  6841114.725821
>>> test['foo'].head()
state  date      
1      2004-01-01    1985886
       2004-02-01    1990082
       2004-03-01    1999936
       2004-04-01    2009556
       2004-05-01    2009573
然后我们将第二个时间序列添加为列dfNew.append(test['foo'])。我被建议ignore_index=True,但我认为因为索引标签是正确的,我们不需要它。

然而,这会破坏我的Python实例。这是数据框的大小:

>>> len(test['foo'])
6864
>>> len(dfNew)
404394

2 个答案:

答案 0 :(得分:1)

以下是我采取的一些步骤。希望这可以引导您走上生成解决方案的道路。

我重新创建了多索引数据框和您提供的时间序列:

In [118]: newdf
Out[118]: 
                      0           1                2
state date                                          
1     2003-01-01    emp           0   3227364.873298
      2003-01-01    NaN           0   6841114.725821
      2003-01-01    NaN           0   6841114.725821
      2003-01-01    NaN           0   6841114.725821
      2003-01-01    NaN           0   6841114.725821
      2003-01-01    NaN           0   6841114.725821
      2003-01-02    NaN           0   5834127.649776
      2003-01-02    NaN           0   5834127.649776
      2003-01-04    emp  2100942000   1506051.861585
      2003-01-04    emp  2100942000   1506051.861585
      2003-01-04    emp  5412841000   1204191.605090
      2003-01-04    emp  5412841000   1204191.605090
      2003-01-04    emp  5412841000   1204191.605090
      2003-01-05    NaN           0   1765953.711812
      2003-01-05    NaN           0   1765953.711812
      2003-01-05    emp           0   1434858.212964
      2003-01-05    emp           0   1434858.212964
      2003-01-05    emp           0   1434858.212964
      2003-01-05    emp           0   1811326.258197
      2003-01-05    emp           0   1811326.258197
      2003-01-05    NaN           0   1908483.149300
      2003-01-05    NaN           0   1908483.149300
      2003-01-06    NaN  1298934000   4190110.086256
      2003-01-07    NaN           0   6241047.457860
      2003-01-07    NaN           0   6241047.457860
      2003-01-07    NaN           0   6241047.457860
      2003-01-07    NaN           0   6241047.457860
      2003-01-08    emp   715231400   4614396.137509
      2003-01-08    emp   715231400   4614396.137509
      2003-01-08    emp   715231400   4614396.137509
2     2013-08-01    emp           0  10571046.129186
      2013-08-01    emp           0  10571046.129186
      2013-08-01    emp           0  10571046.129186
      2013-08-01    emp           0  10571046.129186
      2013-08-27    NaN  6804297000   3376822.385266
      2013-08-27    NaN  6804297000   3376822.385266
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-10-18    emp           0  14402621.620998
      2013-10-18    emp           0  14402621.620998
      2013-11-02  unemp           0   7778017.482167
      2013-11-02  unemp           0   7778017.482167
      2013-11-02  unemp           0   7778017.482167
      2013-11-09    NaN           0   2164565.290873
      2013-11-09    NaN           0   2164565.290873
      2013-11-10    emp   527859500   1759531.507169
      2013-11-10    emp   527859500   1759531.507169
      2013-11-24    emp           0   3050339.003118
      2013-11-24    emp           0   3050339.003118
      2013-11-24    emp           0   3050339.003118
      2013-11-29    NaN           0  11224606.711441
      2013-11-29    NaN           0  11224606.711441
      2013-12-12    emp           0  13804339.863606
      2013-12-12    emp           0  13804339.863606
      2013-12-12    emp           0  13804339.863606
      2013-12-12    emp           0  13804339.863606

In [120]: newfoo
Out[120]: 
                      foo
state date               
1     2004-01-01  1985886
      2004-02-01  1990082
      2004-03-01  1999936
      2004-04-01  2009556
      2004-05-01  2009573
      2004-06-01  2013057
      2004-07-01  2019963
      2004-08-01  2015320
      2004-09-01  2015103
      2004-10-01  2035705
      2004-11-01  2043152
      2004-12-01  2041339
      2005-01-01  2011219
      2005-02-01  2014928
      2005-03-01  2028597
2     2013-10-01   340483
      2013-11-01   338445
      2013-12-01   336903
      2014-01-01   334565
      2014-02-01   334667
      2014-03-01   335922
      2014-04-01   337188
      2014-05-01   343958
      2014-06-01   349122
      2014-07-01   354911
      2014-08-01   350833
      2014-09-01   344849
      2014-10-01   341434
      2014-11-01   339866
      2014-12-01   339203

我弄平了数据框和时间序列:

   In [147]: flattenednewdf
Out[147]: 
    state       date status    shopping         TUFNWGTP
0       1 2003-01-01    emp           0   3227364.873298
1       1 2003-01-01    NaN           0   6841114.725821
2       1 2003-01-01    NaN           0   6841114.725821
3       1 2003-01-01    NaN           0   6841114.725821
4       1 2003-01-01    NaN           0   6841114.725821
5       1 2003-01-01    NaN           0   6841114.725821
6       1 2003-01-02    NaN           0   5834127.649776
7       1 2003-01-02    NaN           0   5834127.649776
8       1 2003-01-04    emp  2100942000   1506051.861585
9       1 2003-01-04    emp  2100942000   1506051.861585
10      1 2003-01-04    emp  5412841000   1204191.605090
11      1 2003-01-04    emp  5412841000   1204191.605090
12      1 2003-01-04    emp  5412841000   1204191.605090
13      1 2003-01-05    NaN           0   1765953.711812
14      1 2003-01-05    NaN           0   1765953.711812
15      1 2003-01-05    emp           0   1434858.212964
16      1 2003-01-05    emp           0   1434858.212964
17      1 2003-01-05    emp           0   1434858.212964
18      1 2003-01-05    emp           0   1811326.258197
19      1 2003-01-05    emp           0   1811326.258197
20      1 2003-01-05    NaN           0   1908483.149300
21      1 2003-01-05    NaN           0   1908483.149300
22      1 2003-01-06    NaN  1298934000   4190110.086256
23      1 2003-01-07    NaN           0   6241047.457860
24      1 2003-01-07    NaN           0   6241047.457860
25      1 2003-01-07    NaN           0   6241047.457860
26      1 2003-01-07    NaN           0   6241047.457860
27      1 2003-01-08    emp   715231400   4614396.137509
28      1 2003-01-08    emp   715231400   4614396.137509
29      1 2003-01-08    emp   715231400   4614396.137509
30      2 2013-08-01    emp           0  10571046.129186
31      2 2013-08-01    emp           0  10571046.129186
32      2 2013-08-01    emp           0  10571046.129186
33      2 2013-08-01    emp           0  10571046.129186
34      2 2013-08-27    NaN  6804297000   3376822.385266
35      2 2013-08-27    NaN  6804297000   3376822.385266
36      2 2013-09-28    NaN           0   4645591.067481
37      2 2013-09-28    NaN           0   4645591.067481
38      2 2013-09-28    NaN           0   4645591.067481
39      2 2013-09-28    NaN           0   4645591.067481
40      2 2013-09-28    NaN           0   4645591.067481
41      2 2013-09-28    NaN           0   4645591.067481
42      2 2013-10-18    emp           0  14402621.620998
43      2 2013-10-18    emp           0  14402621.620998
44      2 2013-11-02  unemp           0   7778017.482167
45      2 2013-11-02  unemp           0   7778017.482167
46      2 2013-11-02  unemp           0   7778017.482167
47      2 2013-11-09    NaN           0   2164565.290873
48      2 2013-11-09    NaN           0   2164565.290873
49      2 2013-11-10    emp   527859500   1759531.507169
50      2 2013-11-10    emp   527859500   1759531.507169
51      2 2013-11-24    emp           0   3050339.003118
52      2 2013-11-24    emp           0   3050339.003118
53      2 2013-11-24    emp           0   3050339.003118
54      2 2013-11-29    NaN           0  11224606.711441
55      2 2013-11-29    NaN           0  11224606.711441
56      2 2013-12-12    emp           0  13804339.863606
57      2 2013-12-12    emp           0  13804339.863606
58      2 2013-12-12    emp           0  13804339.863606
59      2 2013-12-12    emp           0  13804339.863606


In [143]: flattenedfoo
Out[143]: 
    state       date      foo
0       1 2004-01-01  1985886
1       1 2004-02-01  1990082
2       1 2004-03-01  1999936
3       1 2004-04-01  2009556
4       1 2004-05-01  2009573
5       1 2004-06-01  2013057
6       1 2004-07-01  2019963
7       1 2004-08-01  2015320
8       1 2004-09-01  2015103
9       1 2004-10-01  2035705
10      1 2004-11-01  2043152
11      1 2004-12-01  2041339
12      1 2005-01-01  2011219
13      1 2005-02-01  2014928
14      1 2005-03-01  2028597
15      2 2013-10-01   340483
16      2 2013-11-01   338445
17      2 2013-12-01   336903
18      2 2014-01-01   334565
19      2 2014-02-01   334667
20      2 2014-03-01   335922
21      2 2014-04-01   337188
22      2 2014-05-01   343958
23      2 2014-06-01   349122
24      2 2014-07-01   354911
25      2 2014-08-01   350833
26      2 2014-09-01   344849
27      2 2014-10-01   341434
28      2 2014-11-01   339866
29      2 2014-12-01   339203

我将时间序列附加到数据框。我在底部留下了行数和列数,因此您可以根据您提供的示例验证这是正确的数据框大小:

In [149]: final_df
Out[149]: 
          TUFNWGTP       date      foo    shopping  state status
0   3227364.873298 2003-01-01      NaN           0      1    emp
1   6841114.725821 2003-01-01      NaN           0      1    NaN
2   6841114.725821 2003-01-01      NaN           0      1    NaN
3   6841114.725821 2003-01-01      NaN           0      1    NaN
4   6841114.725821 2003-01-01      NaN           0      1    NaN
5   6841114.725821 2003-01-01      NaN           0      1    NaN
6   5834127.649776 2003-01-02      NaN           0      1    NaN
7   5834127.649776 2003-01-02      NaN           0      1    NaN
8   1506051.861585 2003-01-04      NaN  2100942000      1    emp
9   1506051.861585 2003-01-04      NaN  2100942000      1    emp
10  1204191.605090 2003-01-04      NaN  5412841000      1    emp
11  1204191.605090 2003-01-04      NaN  5412841000      1    emp
12  1204191.605090 2003-01-04      NaN  5412841000      1    emp
13  1765953.711812 2003-01-05      NaN           0      1    NaN
14  1765953.711812 2003-01-05      NaN           0      1    NaN
15  1434858.212964 2003-01-05      NaN           0      1    emp
16  1434858.212964 2003-01-05      NaN           0      1    emp
17  1434858.212964 2003-01-05      NaN           0      1    emp
18  1811326.258197 2003-01-05      NaN           0      1    emp
19  1811326.258197 2003-01-05      NaN           0      1    emp
20  1908483.149300 2003-01-05      NaN           0      1    NaN
21  1908483.149300 2003-01-05      NaN           0      1    NaN
22  4190110.086256 2003-01-06      NaN  1298934000      1    NaN
23  6241047.457860 2003-01-07      NaN           0      1    NaN
24  6241047.457860 2003-01-07      NaN           0      1    NaN
25  6241047.457860 2003-01-07      NaN           0      1    NaN
26  6241047.457860 2003-01-07      NaN           0      1    NaN
27  4614396.137509 2003-01-08      NaN   715231400      1    emp
28  4614396.137509 2003-01-08      NaN   715231400      1    emp
29  4614396.137509 2003-01-08      NaN   715231400      1    emp
..             ...        ...      ...         ...    ...    ...
0              NaN 2004-01-01  1985886         NaN      1    NaN
1              NaN 2004-02-01  1990082         NaN      1    NaN
2              NaN 2004-03-01  1999936         NaN      1    NaN
3              NaN 2004-04-01  2009556         NaN      1    NaN
4              NaN 2004-05-01  2009573         NaN      1    NaN
5              NaN 2004-06-01  2013057         NaN      1    NaN
6              NaN 2004-07-01  2019963         NaN      1    NaN
7              NaN 2004-08-01  2015320         NaN      1    NaN
8              NaN 2004-09-01  2015103         NaN      1    NaN
9              NaN 2004-10-01  2035705         NaN      1    NaN
10             NaN 2004-11-01  2043152         NaN      1    NaN
11             NaN 2004-12-01  2041339         NaN      1    NaN
12             NaN 2005-01-01  2011219         NaN      1    NaN
13             NaN 2005-02-01  2014928         NaN      1    NaN
14             NaN 2005-03-01  2028597         NaN      1    NaN
15             NaN 2013-10-01   340483         NaN      2    NaN
16             NaN 2013-11-01   338445         NaN      2    NaN
17             NaN 2013-12-01   336903         NaN      2    NaN
18             NaN 2014-01-01   334565         NaN      2    NaN
19             NaN 2014-02-01   334667         NaN      2    NaN
20             NaN 2014-03-01   335922         NaN      2    NaN
21             NaN 2014-04-01   337188         NaN      2    NaN
22             NaN 2014-05-01   343958         NaN      2    NaN
23             NaN 2014-06-01   349122         NaN      2    NaN
24             NaN 2014-07-01   354911         NaN      2    NaN
25             NaN 2014-08-01   350833         NaN      2    NaN
26             NaN 2014-09-01   344849         NaN      2    NaN
27             NaN 2014-10-01   341434         NaN      2    NaN
28             NaN 2014-11-01   339866         NaN      2    NaN
29             NaN 2014-12-01   339203         NaN      2    NaN

[90 rows x 6 columns]

构建时间仓对我来说是新的,但要使用您提供的方法,我必须将索引设置回日期列。我创建了一个新的数据框,因为很多这个过程都是实验性的,我不想重建旧的数据框:

final_df_2 = final_df.set_index(['date'])

从这一点开始,您应该可以进行任何您想要的计算。我根据你的代码在下面运行了一些,但问题是我们选择性地分组,所以结果看起来很奇怪:

In [187]: doWhat = {'shopping' : np.sum, 'TUFNWGTP': np.sum, 'foo' : np.mean}

In [188]: aggASS = final_df_2.groupby([pd.TimeGrouper("2AS", label='left')]).agg(doWhat)
In [189]: aggASS
Out[189]: 
                       foo     shopping      TUFNWGTP
date                                                 
2003-01-01  2014889.333333  23885035200  1.139995e+08
2005-01-01  2018248.000000          NaN           NaN
2013-01-01   341489.933333  14664313000  2.237165e+08

In [190]: aggASS = final_df_2.groupby(['state', pd.TimeGrouper("2AS", label='left'), 'status']).agg(doWhat)

In [191]: aggASS
Out[191]: 
                         foo     shopping      TUFNWGTP
state date       status                                
1     2003-01-01 emp     NaN  22586101200  3.162246e+07
2     2013-01-01 emp     NaN   1055719000  1.389769e+08
                 unemp   NaN            0  2.333405e+07

我使用cut方法阅读了另一篇关于分组的帖子。你可以在这里阅读 - Grouping data by value ranges。我想你可以使用datetime对象操作来构建2年的桶。

答案 1 :(得分:1)

这是@ kennes913答案的相关部分,仅供未来访问者概述:

# flatten the data frames. For overview, just select one column each
df1flat = df.reset_index()[['state', 'date', 'TUFNWGTP']]
df2flat = df_emp.reset_index()[['state', 'date', 'foo']]
# the "merge"
X = df1flat.append(df2flat)
# now, recover the original data frames:
test1 = X.loc[np.isnan(X.foo) == False, ['state', 'date', 'foo']]
# fix dtype which was lost in the merge
test1['state'] = test1['state'].astype(int)

test2 = X.loc[np.isnan(X.TUCASEID) == False, ['state', 'date', 'TUFNWGTP']]
# check if nothing was lost:
print assert_frame_equal(bar, test1) # output: None
print assert_frame_equal(foo, test2) # output: None