这是我数据集的一部分。
Country Australia Belgium
gdp wage gdp wage
2006-01-01 00:00:00 745,522,000,000.00 23,826.64 409,813,000,000.00 20,228.74
2007-01-01 00:00:00 851,963,000,000.00 24,616.84 471,821,000,000.00 20,486.16
2008-01-01 00:00:00 1,052,580,000,000.00 24,185.70 518,626,000,000.00 20,588.93
2009-01-01 00:00:00 926,448,000,000.00 24,496.84 484,553,000,000.00 21,284.21
2010-01-01 00:00:00 1,144,260,000,000.00 24,373.76 483,548,000,000.00 20,967.05
我想找到两个国家的“ gdp”列和“ wage”列的相关性。
我尝试使用,
df.corr()
但是输出结果为空。
预期输出可以是这样的:
Country Correlation
Australia 1.0
Belgium 0.98
(相关性的值不准确。显示此仅供参考。)
我可以运行哪些代码来实现此结果?
编辑: 执行生产线
print(df.columns)
产生这样的输出
MultiIndex(levels=[['Australia', 'Belgium', 'Brazil', 'Canada', 'Chile', 'Colombia', 'Costa Rica', 'Czech Republic', 'Estonia', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland', 'Israel', 'Japan', 'Korea', 'Latvia', 'Lithuania', 'Luxembourg', 'Mexico', 'Netherlands', 'New Zealand', 'Poland', 'Portugal', 'Russian Federation', 'Slovak Republic', 'Slovenia', 'Spain', 'Turkey', 'United Kingdom', 'United States'], ['gdp', 'wage']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17, 17, 18, 18, 19, 19, 20, 20, 21, 21, 22, 22, 23, 23, 24, 24, 25, 25, 26, 26, 27, 27, 28, 28, 29, 29, 30, 30, 31, 31], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=['Country', None])
答案 0 :(得分:2)
首先使用replace
将列转换为数字,然后转换为float
,然后使用DataFrame.xs
用DataFrame.corrwith
选择级别进行关联:
#if create DataFrame from file
#df = pd.read_csv(file, header=[0,1], thousands=',')
df = df.replace(',','', regex=True).astype(float)
s = df.xs('gdp', axis=1, level=1).corrwith(df.xs('wage', axis=1, level=1))
print (s)
Australia 0.325915
Belgium 0.521564
dtype: float64
为DataFrame添加最后一个reset_index
:
df1 = s.reset_index()
df1.columns = ['Country','Correlation']
print (df1)
Country Correlation
0 Australia 0.325915
1 Belgium 0.521564
详细信息:
print (df.xs('gdp', axis=1, level=1))
Australia Belgium
2006-01-01 00:00:00 7.455220e+11 4.098130e+11
2007-01-01 00:00:00 8.519630e+11 4.718210e+11
2008-01-01 00:00:00 1.052580e+12 5.186260e+11
2009-01-01 00:00:00 9.264480e+11 4.845530e+11
2010-01-01 00:00:00 1.144260e+12 4.835480e+11