仅包含相关列的重叠数据帧

时间:2016-05-20 09:01:27

标签: python pandas dataframe

我有以下数据框:

data = {'year': [2010, 2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012, 2013],
            'store_number': ['1944', '1945', '1946', '1947', '1948', '1949', '1947', '1948', '1949', '1947'],
            'retailer_name': ['Walmart','Walmart', 'CRV', 'CRV', 'CRV', 'Walmart', 'Walmart', 'CRV', 'CRV', 'CRV'],
            'product': ['a', 'b', 'a', 'a', 'b', 'a', 'b', 'a', 'a', 'c'],
            'amount': [5, 5, 8, 6, 1, 5, 10, 6, 12, 11],
            'vat': [0.5, 0.5, 0.8, 0.6, 0.1, 0.5, 0.10, 0.6, 0.12, 0.11]}

    stores = pd.DataFrame(data, columns=['retailer_name', 'store_number', 'year', 'product', 'amount', 'vat'])
    stores.set_index(['retailer_name', 'store_number', 'year', 'product'], inplace=True)
    df = stores.groupby(level=[0, 1, 2, 3]).sum().unstack('product')
    mask = pd.IndexSlice['amount', :]
    df.loc[:, mask] = df.loc[:, mask].fillna(0)

我得到以下输出:

                                amount           vat           
product                              a   b   c     a    b     c
retailer_name store_number year                                
CRV           1946         2011      8   0   0  0.80  NaN   NaN
              1947         2012      6   0   0  0.60  NaN   NaN
                           2013      0   0  11   NaN  NaN  0.11
              1948         2011      6   1   0  0.60  0.1   NaN
              1949         2012     12   0   0  0.12  NaN   NaN
Walmart       1944         2010      5   0   0  0.50  NaN   NaN
              1945         2010      0   5   0   NaN  0.5   NaN
              1947         2010      0  10   0   NaN  0.1   NaN
              1949         2012      5   0   0  0.50  NaN   NaN

我在最终结果中不需要这些vat列,如何从我的unstack中删除它们?

1 个答案:

答案 0 :(得分:1)

对我而言:

df = stores.groupby(level=[0, 1, 2, 3]).sum().unstack('product')

df = df['amount'].fillna(0)
print (df)
product                             a     b     c
retailer_name store_number year                  
CRV           1946         2011   8.0   0.0   0.0
              1947         2012   6.0   0.0   0.0
                           2013   0.0   0.0  11.0
              1948         2011   6.0   1.0   0.0
              1949         2012  12.0   0.0   0.0
Walmart       1944         2010   5.0   0.0   0.0
              1945         2010   0.0   5.0   0.0
              1947         2010   0.0  10.0   0.0
              1949         2012   5.0   0.0   0.0

所有在一起:

df = stores.groupby(level=[0, 1, 2, 3]).sum().unstack('product')['amount'].fillna(0)
print (df)

product                             a     b     c
retailer_name store_number year                  
CRV           1946         2011   8.0   0.0   0.0
              1947         2012   6.0   0.0   0.0
                           2013   0.0   0.0  11.0
              1948         2011   6.0   1.0   0.0
              1949         2012  12.0   0.0   0.0
Walmart       1944         2010   5.0   0.0   0.0
              1945         2010   0.0   5.0   0.0
              1947         2010   0.0  10.0   0.0
              1949         2012   5.0   0.0   0.0

另一个解决方案是为sum选择列amount

df = stores.groupby(level=[0, 1, 2, 3])['amount'].sum().unstack('product').fillna(0)
print (df)
product                             a     b     c
retailer_name store_number year                  
CRV           1946         2011   8.0   0.0   0.0
              1947         2012   6.0   0.0   0.0
                           2013   0.0   0.0  11.0
              1948         2011   6.0   1.0   0.0
              1949         2012  12.0   0.0   0.0
Walmart       1944         2010   5.0   0.0   0.0
              1945         2010   0.0   5.0   0.0
              1947         2010   0.0  10.0   0.0
              1949         2012   5.0   0.0   0.0