我将CSV数据集加载到了数据框中。我想展示列之间的最高相关性(前10个负数和前10个正数)
我在这个网站上找到了我认为可以帮助我的代码-
def get_redundant_pairs(df):
'''Get diagonal and lower triangular pairs of correlation matrix'''
pairs_to_drop = set()
cols = df.columns
for i in range(0, df.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
return pairs_to_drop
def get_top_abs_correlations(df, n=5):
au_corr = df.corr().abs().unstack()
labels_to_drop = get_redundant_pairs(df)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]
我从DataFrame调用此函数-
train = pd.read_csv('/content/drive/My Drive/DSF_HW3_Datasets/train.csv')
get_top_abs_correlations(train.loc[:, train.columns != 'Id'],10)
我得到一个KeyError值-
KeyError: 'Foundation'
During handling of the above exception, another exception occurred:
....
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/multi.py in get_loc(self, key, method)
2404
2405 if keylen == self.nlevels and self.is_unique:
-> 2406 return self._engine.get_loc(key)
2407
2408 # -- partial selection or non-unique index
pandas/_libs/index.pyx in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc()
KeyError: ('Foundation', 'OverallQual')
如何解决此错误? Train.csv文件-https://pastebin.com/vTh6md5W
答案 0 :(得分:0)
您要屏蔽/最大:
# get the correlation matrix
corr = df.corr()
# mask away the lower triangle and diagonal
mask = np.triu(np.ones_like(corr),1) == 1
# get the upper triangle (excluding diagonal) by masking and stack:
corr = corr.where(mask).stack()
# 10 largest by absolute values
max10 = corr.abs().nlargest(10)
输出(最大10):
GarageCars GarageArea 0.882475
YearBuilt GarageYrBlt 0.825667
GrLivArea TotRmsAbvGrd 0.825489
TotalBsmtSF 1stFlrSF 0.819530
OverallQual SalePrice 0.790982
GrLivArea SalePrice 0.708624
2ndFlrSF GrLivArea 0.687501
BedroomAbvGr TotRmsAbvGrd 0.676620
BsmtFinSF1 BsmtFullBath 0.649212
YearRemodAdd GarageYrBlt 0.642277
dtype: float64
要获取原始(已签名)相关性:
corr.loc[max10.index]
恰好与绝对最大值相同。