Question

如何在日期时间内建立索引的多级数据帧，如下所示：这是下载的Fin数据。困难的部分是进入框架并访问特定内部级别的非相邻行，而没有明确指定外部级别日期，因为我有数千个这样的行..

                                       ABC        DEF        GHI  \  
Date                STATS                                            
2012-07-19 00:00:00                    NaN         NaN         NaN   
                    investment        4             9          13        
                    price             5             8          1  
                    quantity          12            9          8

所以我搜索的两个公式可以概括为

X(today row) = quantity(prior row)*price(prior row) 
or                           
X(today row) = quantity(prior row)*price(today)

难点在于如何使用numpy或panda为多级索引制定对这些行的访问，并且行不相邻。

最后我会以此结束：

                                         ABC        DEF        GHI    XN
Date                STATS                                            
2012-07-19 00:00:00                    NaN         NaN         NaN   
                    investment          4            9          13    X1
                    price               5            8           1   
                    quantity            12           9           8    

2012-07-18 00:00:00                    NaN         NaN         NaN   
                    investment          1             2          3    X2
                    price               2             3          4   
                    quantity           18             6          7    

X1= (18*2)+(6*3)+(7*4) (quantity_day_2 *price_day_2 data) 
or for the other formula
X1= (18*5)+(6*8)+(7*1) (quantity_day_2 *price_day_1 data)

我可以使用groupby吗？

Answer 1

您可以使用：

#add new datetime with data for better testing
print (df)
                        ABC  DEF   GHI
Date       STATS                      
2012-07-19              NaN  NaN   NaN
           investment   4.0  9.0  13.0
           price        5.0  8.0   1.0
           quantity    12.0  9.0   8.0
2012-07-18              NaN  NaN   NaN
           investment   1.0  2.0   3.0
           price        2.0  3.0   4.0
           quantity    18.0  6.0   7.0
2012-07-17              NaN  NaN   NaN
           investment   1.0  2.0   3.0
           price        0.0  1.0   4.0
           quantity     5.0  1.0   0.0

#lexsorted Multiindex           
df.sort_index(inplace=True)

#select data and remove last level, because:
#1. need shift
#2. easier working
idx = pd.IndexSlice
p = df.loc[idx[:,'price'],:]
p.index = p.index.droplevel(-1)
q = df.loc[idx[:,'quantity'],:]
q.index = q.index.droplevel(-1)
print (p)
            ABC  DEF  GHI
Date                     
2012-07-17  0.0  1.0  4.0
2012-07-18  2.0  3.0  4.0
2012-07-19  5.0  8.0  1.0

print (q)
             ABC  DEF  GHI
Date                      
2012-07-17   5.0  1.0  0.0
2012-07-18  18.0  6.0  7.0
2012-07-19  12.0  9.0  8.0

print (p * q)
             ABC   DEF   GHI
Date                        
2012-07-17   0.0   1.0   0.0
2012-07-18  36.0  18.0  28.0
2012-07-19  60.0  72.0   8.0

print ((p * q).sum(axis=1).to_frame().rename(columns={0:'col1'}))
             col1
Date             
2012-07-17    1.0
2012-07-18   82.0
2012-07-19  140.0

#shift row with -1, because lexsorted df
print (p.shift(-1, freq='D') * q)
             ABC   DEF  GHI
Date                       
2012-07-16   NaN   NaN  NaN
2012-07-17  10.0   3.0  0.0
2012-07-18  90.0  48.0  7.0
2012-07-19   NaN   NaN  NaN

print ((p.shift(-1, freq='D') * q).sum(axis=1).to_frame().rename(columns={0:'col2'}))
             col2
Date             
2012-07-16    0.0
2012-07-17   13.0
2012-07-18  145.0
2012-07-19    0.0

Answer 2

如果需要将输出添加到原始DataFrame，那么它会更复杂：

print (df)
                        ABC  DEF   GHI
Date       STATS                      
2012-07-19              NaN  NaN   NaN
           investment   4.0  9.0  13.0
           price        5.0  8.0   1.0
           quantity    12.0  9.0   8.0
2012-07-18              NaN  NaN   NaN
           investment   1.0  2.0   3.0
           price        2.0  3.0   4.0
           quantity    18.0  6.0   7.0
2012-07-17              NaN  NaN   NaN
           investment   1.0  2.0   3.0
           price        0.0  1.0   4.0
           quantity     5.0  1.0   0.0

df.sort_index(inplace=True)

#rename value in level to investment - align data in final concat
idx = pd.IndexSlice
p = df.loc[idx[:,'price'],:].rename(index={'price':'investment'})
q = df.loc[idx[:,'quantity'],:].rename(index={'quantity':'investment'})
print (p)
                       ABC  DEF  GHI
Date       STATS                    
2012-07-17 investment  0.0  1.0  4.0
2012-07-18 investment  2.0  3.0  4.0
2012-07-19 investment  5.0  8.0  1.0

print (q)
                        ABC  DEF  GHI
Date       STATS                     
2012-07-17 investment   5.0  1.0  0.0
2012-07-18 investment  18.0  6.0  7.0
2012-07-19 investment  12.0  9.0  8.0

#multiple and concat to original df
print (p * q)
                        ABC   DEF   GHI
Date       STATS                       
2012-07-17 investment   0.0   1.0   0.0
2012-07-18 investment  36.0  18.0  28.0
2012-07-19 investment  60.0  72.0   8.0

a = (p * q).sum(axis=1).rename('col1')
print (pd.concat([df, a], axis=1))
                        ABC  DEF   GHI   col1
Date       STATS                             
2012-07-17              NaN  NaN   NaN    NaN
           investment   1.0  2.0   3.0    1.0
           price        0.0  1.0   4.0    NaN
           quantity     5.0  1.0   0.0    NaN
2012-07-18              NaN  NaN   NaN    NaN
           investment   1.0  2.0   3.0   82.0
           price        2.0  3.0   4.0    NaN
           quantity    18.0  6.0   7.0    NaN
2012-07-19              NaN  NaN   NaN    NaN
           investment   4.0  9.0  13.0  140.0
           price        5.0  8.0   1.0    NaN
           quantity    12.0  9.0   8.0    NaN

#shift with Multiindex - not supported yet - first create Datatimeindex with unstack
#, then shift and last reshape to original by stack

#multiple and concat to original df
print (p.unstack().shift(-1, freq='D').stack() * q)
                        ABC   DEF  GHI
Date       STATS                      
2012-07-16 investment   NaN   NaN  NaN
2012-07-17 investment  10.0   3.0  0.0
2012-07-18 investment  90.0  48.0  7.0
2012-07-19 investment   NaN   NaN  NaN

b = (p.unstack().shift(-1, freq='D').stack() * q).sum(axis=1).rename('col2')
print (pd.concat([df, b], axis=1))
                        ABC  DEF   GHI   col2
Date       STATS                             
2012-07-16 investment   NaN  NaN   NaN    0.0
2012-07-17              NaN  NaN   NaN    NaN
           investment   1.0  2.0   3.0   13.0
           price        0.0  1.0   4.0    NaN
           quantity     5.0  1.0   0.0    NaN
2012-07-18              NaN  NaN   NaN    NaN
           investment   1.0  2.0   3.0  145.0
           price        2.0  3.0   4.0    NaN
           quantity    18.0  6.0   7.0    NaN
2012-07-19              NaN  NaN   NaN    NaN
           investment   4.0  9.0  13.0    0.0
           price        5.0  8.0   1.0    NaN
           quantity    12.0  9.0   8.0    NaN

如何访问multiindex Panda数据帧中的先前行

2 个答案: