如何在Pandas的相关矩阵中找到顶级相关性?关于如何使用R(Show correlations as an ordered list, not as a large matrix或Efficient way to get highly correlated pairs from large data set in Python or R)有很多答案,但我想知道如何用熊猫做到这一点?在我的情况下,矩阵是4460x4460,因此无法直观地进行。
答案 0 :(得分:63)
您可以使用DataFrame.values
获取数据的numpy数组,然后使用NumPy函数(例如argsort()
)来获取最相关的对。
但是如果你想在熊猫中这样做,你可以unstack
和order
数据框:
import pandas as pd
import numpy as np
shape = (50, 4460)
data = np.random.normal(size=shape)
data[:, 1000] += data[:, 2000]
df = pd.DataFrame(data)
c = df.corr().abs()
s = c.unstack()
so = s.sort_values(kind="quicksort")
print so[-4470:-4460]
这是输出:
2192 1522 0.636198
1522 2192 0.636198
3677 2027 0.641817
2027 3677 0.641817
242 130 0.646760
130 242 0.646760
1171 2733 0.670048
2733 1171 0.670048
1000 2000 0.742340
2000 1000 0.742340
dtype: float64
答案 1 :(得分:26)
@ HYRY的答案是完美的。只需通过添加更多逻辑来避免重复和自我关联以及正确排序,从而建立答案:
import pandas as pd
d = {'x1': [1, 4, 4, 5, 6],
'x2': [0, 0, 8, 2, 4],
'x3': [2, 8, 8, 10, 12],
'x4': [-1, -4, -4, -4, -5]}
df = pd.DataFrame(data = d)
print("Data Frame")
print(df)
print()
print("Correlation Matrix")
print(df.corr())
print()
def get_redundant_pairs(df):
'''Get diagonal and lower triangular pairs of correlation matrix'''
pairs_to_drop = set()
cols = df.columns
for i in range(0, df.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
return pairs_to_drop
def get_top_abs_correlations(df, n=5):
au_corr = df.corr().abs().unstack()
labels_to_drop = get_redundant_pairs(df)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]
print("Top Absolute Correlations")
print(get_top_abs_correlations(df, 3))
这给出了以下输出:
Data Frame
x1 x2 x3 x4
0 1 0 2 -1
1 4 0 8 -4
2 4 8 8 -4
3 5 2 10 -4
4 6 4 12 -5
Correlation Matrix
x1 x2 x3 x4
x1 1.000000 0.399298 1.000000 -0.969248
x2 0.399298 1.000000 0.399298 -0.472866
x3 1.000000 0.399298 1.000000 -0.969248
x4 -0.969248 -0.472866 -0.969248 1.000000
Top Absolute Correlations
x1 x3 1.000000
x3 x4 0.969248
x1 x4 0.969248
dtype: float64
答案 2 :(得分:15)
没有冗余变量对的几行解决方案:
corr_matrix = df.corr().abs()
#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
#first element of sol series is the pair with the bigest correlation
答案 3 :(得分:7)
结合@HYRY和@arun的某些功能,可以使用以下命令在一行中打印数据帧df
的最高相关性:
df.corr().unstack().sort_values().drop_duplicates()
注意:一个缺点是,如果您具有1.0个相关性,而它们本身不是一个变量,则drop_duplicates()
加法将删除它们
答案 4 :(得分:4)
您可以通过替换数据来根据此简单代码以图形方式进行操作。
corr = df.corr()
kot = corr[corr>=.9]
plt.figure(figsize=(12,8))
sns.heatmap(kot, cmap="Greens")
答案 5 :(得分:2)
使用itertools.combinations
从pandas拥有的相关矩阵.corr()
中获取所有唯一关联,生成列表列表并将其反馈到DataFrame中,以便使用'.sort_values'。设置ascending = True
以在顶部显示最低相关性
corrank
将DataFrame作为参数,因为它需要.corr()
。
def corrank(X):
import itertools
df = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pairs','corr'])
print(df.sort_values(by='corr',ascending=False))
corrank(X) # prints a descending list of correlation pair (Max on top)
答案 6 :(得分:2)
使用以下代码以降序查看相关性。
# See the correlations in descending order
corr = df.corr() # df is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)
答案 7 :(得分:2)
我最喜欢Addison Klinke的帖子,因为它是最简单的,但我使用Wojciech Moszczyzysk的建议进行过滤和制图,但是扩展了该过滤器以避免绝对值,因此给定一个大的相关矩阵,对其进行过滤,制图,然后压平:
创建,过滤和绘制图表
dfCorr = df.corr()
filteredDf = dfCorr[((dfCorr >= .5) | (dfCorr <= -.5)) & (dfCorr !=1.000)]
plt.figure(figsize=(30,10))
sn.heatmap(filteredDf, annot=True, cmap="Reds")
plt.show()
功能
最后,我创建了一个小函数来创建相关矩阵,对其进行过滤,然后对其进行展平。可以很容易地扩展它,例如不对称的上下限等。
def corrFilter(x: pd.DataFrame, bound: float):
xCorr = x.corr()
xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
return xFlattened
corrFilter(df, .7)
答案 8 :(得分:2)
将上面的大部分答案组合成一个简短的片段:
def top_entries(df):
mat = df.corr().abs()
# Remove duplicate and identity entries
mat.loc[:,:] = np.tril(mat.values, k=-1)
mat = mat[mat>0]
# Unstack, sort ascending, and reset the index, so features are in columns
# instead of indexes (allowing e.g. a pretty print in Jupyter).
# Also rename these it for good measure.
return (mat.unstack()
.sort_values(ascending=False)
.reset_index()
.rename(columns={
"level_0": "feature_a",
"level_1": "feature_b",
0: "correlation"
}))
答案 9 :(得分:1)
这里有很多好的答案。我找到的最简单的方法是上面的一些答案的组合。
corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corr = corr.unstack().transpose()\
.sort_values(by='column', ascending=False)\
.dropna()
答案 10 :(得分:0)
我不想unstack
或使这个问题复杂化,因为在功能选择阶段,我只是想删除一些高度相关的功能。
所以我得到了以下简化的解决方案:
# map features to their absolute correlation values
corr = features.corr().abs()
# set equality (self correlation) as zero
corr[corr == 1] = 0
# of each feature, find the max correlation
# and sort the resulting array in ascending order
corr_cols = corr.max().sort_values(ascending=False)
# display the highly correlated features
display(corr_cols[corr_cols > 0.8])
在这种情况下,如果要删除相关特征,则可以映射经过过滤的corr_cols
数组,并删除奇数索引(或偶数索引)的特征。
答案 11 :(得分:0)
我在这里尝试一些解决方案,但后来我想出了自己的解决方案。我希望这可能对下一个有用,所以我在这里分享:
def sort_correlation_matrix(correlation_matrix):
cor = correlation_matrix.abs()
top_col = cor[cor.columns[0]][1:]
top_col = top_col.sort_values(ascending=False)
ordered_columns = [cor.columns[0]] + top_col.index.tolist()
return correlation_matrix[ordered_columns].reindex(ordered_columns)
答案 12 :(得分:0)
这是@MiFi的改进代码。该顺序为abs,但不排除负值。
def top_correlation (df,n):
corr_matrix = df.corr()
correlation = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
correlation = pd.DataFrame(correlation).reset_index()
correlation.columns=["Variable_1","Variable_2","Correlacion"]
correlation = correlation.reindex(correlation.Correlacion.abs().sort_values(ascending=False).index).reset_index().drop(["index"],axis=1)
return correlation.head(n)
top_correlation(ANYDATA,10)
答案 13 :(得分:0)
以下功能可以解决问题。此实现
,它也是可配置的,因此您既可以保留自相关,也可以保留重复。您还可以根据需要报告尽可能多的功能对。
def get_feature_correlation(df, top_n=None, corr_method='spearman',
remove_duplicates=True, remove_self_correlations=True):
"""
Compute the feature correlation and sort feature pairs based on their correlation
:param df: The dataframe with the predictor variables
:type df: pandas.core.frame.DataFrame
:param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
:param corr_method: Correlation compuation method
:type corr_method: str
:param remove_duplicates: Indicates whether duplicate features must be removed
:type remove_duplicates: bool
:param remove_self_correlations: Indicates whether self correlations will be removed
:type remove_self_correlations: bool
:return: pandas.core.frame.DataFrame
"""
corr_matrix_abs = df.corr(method=corr_method).abs()
corr_matrix_abs_us = corr_matrix_abs.unstack()
sorted_correlated_features = corr_matrix_abs_us \
.sort_values(kind="quicksort", ascending=False) \
.reset_index()
# Remove comparisons of the same feature
if remove_self_correlations:
sorted_correlated_features = sorted_correlated_features[
(sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
]
# Remove duplicates
if remove_duplicates:
sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]
# Create meaningful names for the columns
sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)']
if top_n:
return sorted_correlated_features[:top_n]
return sorted_correlated_features