如何处理Pandas中的2列并使用新列名创建新数据框

时间:2018-03-01 03:42:56

标签: python pandas

我想在my_list中计算每列相互之间的协方差。该公式位于函数def covariance_formula(...):

我的代码如下:

#!/usr/bin/python3

import pandas as pd
import numpy as np

my_list = ['A', 'B', 'C', 'D', 'E']

def create_df():
    return pd.DataFrame(np.random.randint(0,100,size=(5, 5)).astype(float), columns=my_list)


def iterate_list(df):
    for i in range(len(my_list)):
        for j in range(i + 1, len(my_list)):
            column_one = my_list[i]
            column_two = my_list[j]
            col_name = column_one + " vs." + column_two

            column_1_value = df[df.columns[df.columns.str.startswith(column_one)]]
            column_2_value = df[df.columns[df.columns.str.startswith(column_two)]]
            column_1_mean = df[df.columns[df.columns.str.startswith(column_one)]].mean(axis=0)
            column_2_mean = df[df.columns[df.columns.str.startswith(column_two)]].mean(axis=0)
            df2[col_name] = covariance_formula(column_1_value, column_2_value, column_1_mean, column_2_mean)

    return df2


def covariance_formula(a, b, mean_a, mean_b):
    covar = (a - mean_a) * (b - mean_b)
    return covar


def main():
    df = create_df()
    # print(df)               ## see OUTPUT A 
    df2 = iterate_list(df)    ## <<< THIS IS WHERE I AM HAVING MY PROBLEM
    # print(df2)              ## see EXPECTED OUTPUT B
    print(df2)


if __name__ == "__main__":
    main()

问题:

如何创建一个新的df df2,其输出将在 EXPECTED OUTPUT B 中?有没有更快的方法呢?

当前问题:

我面临的当前问题是我似乎无法摆脱这个:

  

NameError:未定义名称'df2'

我尝试过的事情:

输出A

      A     B     C     D     E
0  87.0  92.0  66.0   8.0  67.0
1  84.0  18.0   9.0  80.0  41.0
2  38.0  24.0  53.0  25.0  14.0
3  87.0  25.0  19.0   5.0   0.0
4  91.0  69.0  55.0  14.0  90.0

预期输出B

    A vs.B  A vs.C  A vs.D  A vs.E  B vs.C   B vs.D  B vs.E  C vs.D C vs.E  D vs.E
0    445.4   245.8  -176.6   236.2  1187.8   -853.8  1141.4  -471.0  629.8  -452.6
1   -182.2  -207.2   353.8    -9.2   866.6  -1479.4    38.6 -1683.0   44.0   -75.0
2    851.0  -496.4    55.2  1119.0  -272.2     30.2   613.4   -17.6 -357.8    39.8
3   -197.8  -205.4  -205.4  -407.0   440.8    440.8   873.4   458.0  907.4   907.4 
4    318.2   198.6  -168.6   647.4   341.6   -290.2  1113.8  -181.0  695.0  -590.2

1 个答案:

答案 0 :(得分:2)

如果您使用itertools.combinations()dict comprehension来构建列,则可以更轻松地执行此操作:

代码:

def build_covars(covar_df):
    columns = {i + " vs." + j: covariance_formula(covar_df[i], covar_df[j])
               for i, j in it.combinations(covar_df.columns, 2)}
    return pd.concat(columns, axis=1)

测试代码:

import itertools as it
import pandas as pd

def build_covars(covar_df):
    columns = {i + " vs." + j: covariance_formula(covar_df[i], covar_df[j])
               for i, j in it.combinations(covar_df.columns, 2)}
    return pd.concat(columns, axis=1)

def covariance_formula(a, b):
    return (a - a.mean()) * (b - b.mean())

my_list = ['A', 'B', 'C', 'D', 'E']

def create_df():
    return pd.DataFrame(
        np.random.randint(0, 100, size=(5, 5)).astype(float),
        columns=my_list)

df = create_df()
print(build_covars(df))

结果:

    A vs.B  A vs.C  A vs.D   A vs.E  B vs.C  B vs.D  B vs.E  C vs.D  C vs.E  \
0    52.48   49.92  -43.52   323.84   63.96  -55.76  414.92  -53.04  394.68   
1   127.68  123.12  184.68    18.24  120.96  181.44   17.92  174.96   17.28   
2   175.48  124.12  -17.12    98.44   47.56   -6.56   37.72   -4.64   26.68   
3    10.08 -127.68  -57.12  -280.56  -18.24   -8.16  -40.08  103.36  507.68   
4  1370.88  437.92   85.68  1113.84  264.96   51.84  673.92   16.56  215.28   

   D vs.E  
0 -344.08  
1   25.92  
2   -3.68  
3  227.12  
4   42.12