大熊猫与groupby的相关性

时间:2019-09-19 05:43:37

标签: python pandas

我在熊猫中有以下数据框

 code    tank     var     nozzle_1    nozzle_2     nozzle_3    nozzle_tank
 123     1        23.34   12.23       54.56        12.22       11 
 123     1        22.32   11.32       7.89         3.45        12 
 123     1        21.22   19.93       5.54         5.66        12
 123     1        21.34   12.23       54.56        22.22       14
 123     1        32.32   13.32       4.89         32.45       34
 123     1        32.22   29.93       23.54        23.66       33
 123     2        23.34   12.23       54.56        12.22       21
 123     2        22.32   11.32       7.89         3.45        22
 123     2        21.22   19.93       5.54         5.66        21
 123     2        21.34   12.23       54.56        22.22       21
 123     2        32.32   13.32       4.89         32.45       22
 123     2        32.22   29.93       23.54        23.66       21  

我想计算出Tank_1上的喷嘴_1,喷嘴_2,喷嘴_3和喷嘴_4与var列的相关性

我想要的数据框是

 code   tank    nozzle_1    nozzle_2    nozzle_3    nozzle_4    
 123    1       0.08        0.01        0.02        0.01
 123    2       0.07        0.01        0.02        0.02

我正在熊猫后面追随

cols= df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)] 
cols= np.array(cols)
var_col = 'var'
tank = 'tank'
def corrVar(df, cols, var_col, tank):
        final_df = pd.DataFrame()
        for col in nozzles_to_scale:
            corrs = (df[[col, tank]].groupby(tank).corrwith(df.var_col ).reset_index())
            final_df = final_df.join(corrs)
        return final_df

但是它似乎不起作用,我们如何在熊猫中做到这一点?

    test =  corrVar(df, cols, var, tank)

1 个答案:

答案 0 :(得分:1)

您可以使用:

cols = df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)] 
var_col = 'var'
tank = 'tank'
def corrVar(df, cols, var_col, tank):
        final_df = [df[[col, tank]].groupby(tank).corrwith(df[var_col]) for col in cols]
        return pd.concat(final_df, axis=1)

print (corrVar(df, cols, var_col, tank))
      nozzle_1  nozzle_2  nozzle_3
tank                              
1     0.501164 -0.309435  0.761017
2     0.501164 -0.309435  0.761017

编辑:每组每个N值的相关性解决方案:

N = 3
g = df.groupby('tank').cumcount() // N

cols = df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)] 
var_col = 'var'
tank = 'tank'
code = 'code'
def corrVar(df, cols, var_col, tank, g):
        #https://stackoverflow.com/a/48570300
        final_df = [df.groupby([g, tank]).apply(lambda x: x[col].corr(x[var_col])) 
                            for col in cols]
        return pd.concat(final_df, axis=1, keys=cols)

print (corrVar(df, cols, var_col, tank, g))
        nozzle_1  nozzle_2  nozzle_3
  tank                              
0 1    -0.826376  0.876202  0.703793
  2    -0.826376  0.876202  0.703793
1 1     0.540176 -0.931286  0.614626
  2     0.540176 -0.931286  0.614626

测试组:

print (df.assign(groups=g))
    code  tank    var  nozzle_1  nozzle_2  nozzle_3  nozzle_tank  groups
0    123     1  23.34     12.23     54.56     12.22           11       0
1    123     1  22.32     11.32      7.89      3.45           12       0
2    123     1  21.22     19.93      5.54      5.66           12       0
3    123     1  21.34     12.23     54.56     22.22           14       1
4    123     1  32.32     13.32      4.89     32.45           34       1
5    123     1  32.22     29.93     23.54     23.66           33       1
6    123     2  23.34     12.23     54.56     12.22           21       0
7    123     2  22.32     11.32      7.89      3.45           22       0
8    123     2  21.22     19.93      5.54      5.66           21       0
9    123     2  21.34     12.23     54.56     22.22           21       1
10   123     2  32.32     13.32      4.89     32.45           22       1
11   123     2  32.22     29.93     23.54     23.66           21       1

编辑:

函数应为一行:

def corrVar(df, cols, var_col, tank, g):
        return pd.concat([df.groupby([g, tank]).apply(lambda x: x[col].corr(x[var_col])) 
                          for col in cols], axis=1, keys=cols)