我想确定每个国家/地区的案例和费率之间的相关性 - 排除值为0的所有行(无论是案例还是费率),因为这些是异常值且不相关。
我有一个循环我运行country_df.corr()并且可以看到我想要的东西。如果我能抓住我之后的特定值,那么我可以将其存储在具有国家名称的列表中 - 这就是我想要做的。我只是不知道如何从相关矩阵中提取特定值
然后,我将从列表中选择值大于0.5或更大的条目,可能小于-0.5。预计这种关系是相反的 - 随着疫苗接种率上升,我们预计麻疹病例会下降。
这是循环代码:
df=df2.unstack().fillna(0)
for country in df.columns.get_level_values(0).unique():
country_df = df[[c for c in df.columns if c[0] == country]]
for c in [c for c in country_df.columns if c[1] in ['Cases', 'Rate']]:
country_df = country_df[country_df[c] > 0]
print country_df.corr() # Instead of printing whole correlation here I just want to store the country name & cases/rate correlation
感谢任何帮助
这是创建此数据框的代码:
df2 = pd.DataFrame({u'Afghanistan': {(2000L, 'Cases'): 6532.0,
(2000L, 'Pop'): 19702000.0,
(2000L, 'Rate'): 27.0,
(2001L, 'Cases'): 8762.0,
(2001L, 'Pop'): 20641600.0,
(2001L, 'Rate'): 37.0,
(2002L, 'Cases'): 2486.0,
(2002L, 'Pop'): 21581200.0,
(2002L, 'Rate'): 35.0,
(2003L, 'Cases'): 798.0,
(2003L, 'Pop'): 22520800.0,
(2003L, 'Rate'): 39.0,
(2004L, 'Cases'): 466.0,
(2004L, 'Pop'): 23460400.0,
(2004L, 'Rate'): 48.0,
(2005L, 'Cases'): 1296.0,
(2005L, 'Pop'): 24400000.0,
(2005L, 'Rate'): 50.0},
u'Albania': {(2000L, 'Cases'): 662.0,
(2000L, 'Pop'): 3122000.0,
(2000L, 'Rate'): 95.0,
(2001L, 'Cases'): 18.0,
(2001L, 'Pop'): 3114000.0,
(2001L, 'Rate'): 95.0,
(2002L, 'Cases'): 16.0,
(2002L, 'Pop'): 3106000.0,
(2002L, 'Rate'): 96.0,
(2003L, 'Cases'): 8.0,
(2003L, 'Pop'): 3098000.0,
(2003L, 'Rate'): 93.0,
(2004L, 'Cases'): 7.0,
(2004L, 'Pop'): 3090000.0,
(2004L, 'Rate'): 96.0,
(2005L, 'Cases'): 6.0,
(2005L, 'Pop'): 3082000.0,
(2005L, 'Rate'): 97.0},
u'Algeria': {(2000L, 'Cases'): 0.0,
(2000L, 'Pop'): 31184000.0,
(2000L, 'Rate'): 80.0,
(2001L, 'Cases'): 2686.0,
(2001L, 'Pop'): 31600800.0,
(2001L, 'Rate'): 83.0,
(2002L, 'Cases'): 5862.0,
(2002L, 'Pop'): 32017600.0,
(2002L, 'Rate'): 81.0,
(2003L, 'Cases'): 15374.0,
(2003L, 'Pop'): 32434400.0,
(2003L, 'Rate'): 84.0,
(2004L, 'Cases'): 3289.0,
(2004L, 'Pop'): 32851200.0,
(2004L, 'Rate'): 81.0,
(2005L, 'Cases'): 2302.0,
(2005L, 'Pop'): 33268000.0,
(2005L, 'Rate'): 83.0},
u'Andorra': {(2000L, 'Cases'): 2.0,
(2000L, 'Pop'): 65000.0,
(2000L, 'Rate'): 97.0,
(2001L, 'Cases'): 5.0,
(2001L, 'Pop'): 68200.0,
(2001L, 'Rate'): 97.0,
(2002L, 'Cases'): 1.0,
(2002L, 'Pop'): 71400.0,
(2002L, 'Rate'): 98.0,
(2003L, 'Cases'): 0.0,
(2003L, 'Pop'): 74600.0,
(2003L, 'Rate'): 96.0,
(2004L, 'Cases'): 0.0,
(2004L, 'Pop'): 77800.0,
(2004L, 'Rate'): 98.0,
(2005L, 'Cases'): 0.0,
(2005L, 'Pop'): 81000.0,
(2005L, 'Rate'): 94.0},
u'Angola': {(2000L, 'Cases'): 2219.0,
(2000L, 'Pop'): 15059000.0,
(2000L, 'Rate'): 36.0,
(2001L, 'Cases'): 9046.0,
(2001L, 'Pop'): 15629800.0,
(2001L, 'Rate'): 65.0,
(2002L, 'Cases'): 11945.0,
(2002L, 'Pop'): 16200600.0,
(2002L, 'Rate'): 66.0,
(2003L, 'Cases'): 1196.0,
(2003L, 'Pop'): 16771400.0,
(2003L, 'Rate'): 52.0,
(2004L, 'Cases'): 29.0,
(2004L, 'Pop'): 17342200.0,
(2004L, 'Rate'): 52.0,
(2005L, 'Cases'): 258.0,
(2005L, 'Pop'): 17913000.0,
(2005L, 'Rate'): 32.0}})
答案 0 :(得分:1)
对于您提供的数据,相关矩阵的排序总是相同,因此它看起来像
Angola
Cases Pop Rate
Angola Cases 1.000000 -0.500364 0.779077
Pop -0.500364 1.000000 -0.274885
Rate 0.779077 -0.274885 1.000000
所以你可以用.iloc()
选择你想要的值。只需在循环之前创建一个字典(或列表或任何你想要的),并将国家和值附加到正确的位置。
corr_dict = {}
df=df2.unstack().fillna(0)
for country in df.columns.get_level_values(0).unique():
country_df = df[[c for c in df.columns if c[0] == country]]
for c in [c for c in country_df.columns if c[1] in ['Cases', 'Rate']]:
country_df = country_df[country_df[c] > 0]
corr_dict[country] = country_df.corr().iloc[0,2]
corr_dict
#{'Afghanistan': -0.6404117984998553,
# 'Albania': -0.12115398350489878,
# 'Algeria': 0.5031318694416725,
# 'Andorra': -0.6933752452815364,
# 'Angola': 0.779077493398456}