为什么numpy.corrcoef()
返回NaN
的值?
我正在处理高维数据,因此无法遍历每个基准来测试值。
# Import
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# Delete all zero columns
df = df.loc[:, (df != 0).any(axis=0)]
# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)
# Standardise
X_std = StandardScaler().fit_transform(df.values)
print(X_std.dtype) # Returns "float64"
# Correlation
cor_mat1 = np.corrcoef(X_std.T)
cor_mat1.max() # Returns nan
然后
cor_mat1.max()
返回
nan
在计算cor_mat1 = np.corrcoef(X_std.T)
时收到以下警告:
/Users/kimrants/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py:3183: 运行时警告:
true_divide中遇到无效的值
要尝试自己修复它,我开始删除所有包含 any NaN
值的零列和列。我以为这可以解决问题,但事实并非如此。我想念什么吗?我不明白为什么它还会返回NaN
值?
我的最终目标是计算特征值和-vector。
答案 0 :(得分:1)
如果您有一列所有行的值都相同,则该列的方差为 0
。因此,np.corrcoef()
将该列的相关系数除以 0
,这不会引发错误,而只会引发具有标准 numpy 设置的警告 invalid value encountered in true_divide
。这些列的相关系数被 'nan' 代替:
import numpy as np
print(np.divide(0,0))
C:\Users\anaconda3\lib\site-packages\ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in true_divide
"""Entry point for launching an IPython kernel.
nan
使用 Series.nunique() == 1
删除所有列应该可以解决您的问题。
答案 1 :(得分:0)
由于无法解释的原因,此问题得以解决:
# Import
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)
# Keep track of index / columns to reproduce dataframe
cols = df.columns
index = df.index
# Standardise
X_std = StandardScaler().fit_transform(df.values)
X_std = StandardScaler().fit_transform(X_std)
print(X_std.dtype) # Return "float64"
# Turn to dataFrame again to drop values easier
df = pd.DataFrame(data=X_std, columns= cols, index=index)
# Delete all zero columns
df = df.loc[:, (df != 0).any(axis=0)]
# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)
连续两次标准化是可行的,但是很奇怪。