查找具有多级列的数据集的相关性

时间:2019-01-03 13:52:02

标签: python-3.x pandas

这是我数据集的一部分。

Country              Australia                        Belgium
                                      gdp       wage                 gdp       wage
2006-01-01 00:00:00    745,522,000,000.00  23,826.64  409,813,000,000.00  20,228.74
2007-01-01 00:00:00    851,963,000,000.00  24,616.84  471,821,000,000.00  20,486.16
2008-01-01 00:00:00  1,052,580,000,000.00  24,185.70  518,626,000,000.00  20,588.93
2009-01-01 00:00:00    926,448,000,000.00  24,496.84  484,553,000,000.00  21,284.21
2010-01-01 00:00:00  1,144,260,000,000.00  24,373.76  483,548,000,000.00  20,967.05

我想找到两个国家的“ gdp”列和“ wage”列的相关性。

我尝试使用,

df.corr()

但是输出结果为空。

预期输出可以是这样的:

  Country    Correlation
Australia            1.0
  Belgium           0.98

(相关性的值不准确。显示此仅供参考。)

我可以运行哪些代码来实现此结果?

编辑: 执行生产线

print(df.columns)

产生这样的输出

MultiIndex(levels=[['Australia', 'Belgium', 'Brazil', 'Canada', 'Chile', 'Colombia', 'Costa Rica', 'Czech Republic', 'Estonia', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland', 'Israel', 'Japan', 'Korea', 'Latvia', 'Lithuania', 'Luxembourg', 'Mexico', 'Netherlands', 'New Zealand', 'Poland', 'Portugal', 'Russian Federation', 'Slovak Republic', 'Slovenia', 'Spain', 'Turkey', 'United Kingdom', 'United States'], ['gdp', 'wage']],
       labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17, 17, 18, 18, 19, 19, 20, 20, 21, 21, 22, 22, 23, 23, 24, 24, 25, 25, 26, 26, 27, 27, 28, 28, 29, 29, 30, 30, 31, 31], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
       names=['Country', None])

1 个答案:

答案 0 :(得分:2)

首先使用replace将列转换为数字,然后转换为float,然后使用DataFrame.xsDataFrame.corrwith选择级别进行关联:

#if create DataFrame from file
#df = pd.read_csv(file, header=[0,1], thousands=',')

df = df.replace(',','', regex=True).astype(float)
s = df.xs('gdp', axis=1, level=1).corrwith(df.xs('wage', axis=1, level=1))
print (s)
Australia    0.325915
Belgium      0.521564
dtype: float64

为DataFrame添加最后一个reset_index

df1 = s.reset_index()
df1.columns = ['Country','Correlation']
print (df1)
     Country  Correlation
0  Australia     0.325915
1    Belgium     0.521564

详细信息

print (df.xs('gdp', axis=1, level=1))
                        Australia       Belgium
2006-01-01 00:00:00  7.455220e+11  4.098130e+11
2007-01-01 00:00:00  8.519630e+11  4.718210e+11
2008-01-01 00:00:00  1.052580e+12  5.186260e+11
2009-01-01 00:00:00  9.264480e+11  4.845530e+11
2010-01-01 00:00:00  1.144260e+12  4.835480e+11