Multiindex中的计算列

时间:2018-06-07 21:08:58

标签: python pandas

我想在数据框中插入2列。

原始数据框

card    auth       month   order_number
Amex     A        2017-11       1234
Visa     A        2017-12       2345
Amex     D        2017-12       3456

我想按月分解auth_status。我使用了以下代码:

bin_month_df = monthly_df.pivot_table(index='card', columns=['month', 'auth'],values='order_number', aggfunc='count')

按月划分的数据框

   month         2017-11      2017-12
    auth         A    D       A    D
    card
    mastercard  10    11     11    10
    amex        19    20     10    11
    visa        50    30     50    1

目标结果

我想为subtotalauth_rate

添加列
       month                   2017-11                       2017-12
        auth         A    D   total    pct           A    D    total  pct
        card
        mastercard  10    11     21    .47           11    10   21    .52
        amex        19    20     39    .49           10    11   21    .47
        visa        50    30     80    .63           50    1    51    .98

我在创建这些列时遇到问题。This链接按行显示小计,但它不会将我转换为列或计算列。

感谢任何帮助!

2 个答案:

答案 0 :(得分:1)

使用:

#create sum by first level of MultiIndex
df1 = df.sum(axis=1, level=0)
df1.columns = [df1.columns, ['total'] * len(df1.columns)]
print (df1)
month      2017-11 2017-12
             total   total
card                      
mastercard      21      21
amex            39      21
visa            80      51

#select by second level and divide
df2 = df.xs('A', axis=1, level=1).div(df1.xs('total', axis=1, level=1)).round(2)
df2.columns = [df2.columns, ['pct'] * len(df2.columns)]
print (df2)
month      2017-11 2017-12
               pct     pct
card                      
mastercard    0.48    0.52
amex          0.49    0.48
visa          0.62    0.98

#join all together, sort MultiIndex
df3 = pd.concat([df, df1, df2], axis=1).sort_index(axis=1)
print (df3)
month      2017-11                 2017-12                
auth             A   D   pct total       A   D   pct total
card                                                      
mastercard      10  11  0.48    21      11  10  0.52    21
amex            19  20  0.49    39      10  11  0.48    21
visa            50  30  0.62    80      50   1  0.98    51
#for custom order reindex by custom MultiIndex
c = df.columns.levels[1].tolist() + ['total', 'pct']
mux = pd.MultiIndex.from_product([df.columns.levels[0], c], names=df.columns.names)
df4 = df3.reindex(columns=mux)
print(df4)
month      2017-11                 2017-12                
auth             A   D total   pct       A   D total   pct
card                                                      
mastercard      10  11    21  0.48      11  10    21  0.52
amex            19  20    39  0.49      10  11    21  0.48
visa            50  30    80  0.62      50   1    51  0.98

答案 1 :(得分:0)

刚刚在 Pandas 0.17.0 Python 2.7.5 上测试过,我现在可以理解为什么你问我reindex(轴= 1)和'*的问题'df1.columns.levels[1]之前。这确实是Pandas和Python的版本问题。我修改了代码以运行上面提到的旧版本,并修复了一个潜在的错误,以防多个常见的描述性统计需要在同一个数据透视表中进行后期计算。展望未来,在未来的帖子中提及软件版本(如果它们是旧版本)会更容易,因此会产生更少的误解:

import pandas as pd

str = """card    auth   month   order_number
Amex     A        2017-11       1234
Visa     A        2017-12       2345
Amex     D        2017-12       3416
MC       A        2017-12       3426
Visa     A        2017-11       3436
Amex     D        2017-12       3446
Visa     A        2017-11       3466
Amex     D        2017-12       3476
Visa     D        2017-11       3486
"""

# create dataframe from the above sample data
df = pd.read_table(pd.io.common.StringIO(str), sep='\s+')

# create the pivot_table using the method OP supplied
df1 = df.pivot_table(index='card', columns=['month', 'auth'], values='order_number', aggfunc='count')
print(df1)
# month 2017-11      2017-12     
# auth        A    D       A    D
# card                           
# Amex      1.0  NaN     NaN  3.0
# MC        NaN  NaN     1.0  NaN
# Visa      2.0  1.0     1.0  NaN

# create an empty dataframe with the same index/column layout as df1
# except the level-1 in columns
idx = pd.MultiIndex.from_product([df1.columns.levels[0], ['total', 'avg', 'std', 'pct']], names=df1.columns.names)
df2 = pd.DataFrame(columns=idx, index=df1.index).sort_index(axis=1)

print(df2)
# month 2017-11                 2017-12                
# auth      avg  pct  std total     avg  pct  std total
# card                                                 
# Amex      NaN  NaN  NaN   NaN     NaN  NaN  NaN   NaN
# MC        NaN  NaN  NaN   NaN     NaN  NaN  NaN   NaN
# Visa      NaN  NaN  NaN   NaN     NaN  NaN  NaN   NaN

# Calculate the common stats:
df2.loc[:,(slice(None),'total')] = df1.groupby(level=0, axis=1).sum().values
df2.loc[:,(slice(None),'avg')]   = df1.groupby(level=0, axis=1).mean().values
df2.loc[:,(slice(None),'std')]   = df1.groupby(level=0, axis=1).std().values

# join df2 with df1 and assign the result to df3 (can also overwrite df1): 
df3 = df1.join(df2).sort_index(axis=1)

# calculate `pct` which needs both a calculated field and an original field
# auth-rate = A / total
df3.loc[:,(slice(None),'pct')] = df3.groupby(level=0, axis=1)\
                                    .apply(lambda x: x.loc[:,(slice(None),'A')].values/x.loc[:,(slice(None),'total')].values) \
                                    .values

print(df3)
# month 2017-11                                    2017-12                      
# auth        A   D  avg       pct       std total       A   D avg pct std total
# card                                                                          
# Amex        1 NaN  1.0  1.000000       NaN     1     NaN   3   3 NaN NaN     3
# MC        NaN NaN  NaN       NaN       NaN   NaN       1 NaN   1   1 NaN     1
# Visa        2   1  1.5  0.666667  0.707107     3       1 NaN   1   1 NaN     1

# rounding if needed:
df3.loc[:,(slice(None),'pct')] = df3.loc[:,(slice(None),'pct')].round(decimals=2)

如果要按特定顺序对level-1列进行排序,可以执行reindex()。

# create a ordered list of level-1 on columns 
column_level_1 = list(df1.columns.levels[1]) + ['total', 'avg', 'std', 'pct']
# create MultiIndex for columns and reindex_axis accordingly
midx = pd.MultiIndex.from_product([df1.columns.levels[0], column_level_1], names=df1.columns.names)
df3 = df3.reindex_axis(midx, axis=1)
print(df3)

# month 2017-11                                    2017-12                      
# auth        A   D total  avg       std       pct       A   D total avg std pct
# card                                                                          
# Amex        1 NaN     1  1.0       NaN  1.000000     NaN   3     3   3 NaN NaN
# MC        NaN NaN   NaN  NaN       NaN       NaN       1 NaN     1   1 NaN   1
# Visa        2   1     3  1.5  0.707107  0.666667       1 NaN     1   1 NaN   1