我合并了两个具有一些常见列的数据框,但是有一些不同的列。我想在组合数据帧上应用奇异值分解(SVD)。但是,填充NaN值会影响结果,即使用零填充数据也是错误的,因为有些列的值为零。这是一个例子。有没有办法解决这个问题?。
>>> df1 = pd.DataFrame(np.random.rand(6, 4), columns=['A', 'B', 'C', 'D'])
>>> df1
A B C D
0 0.763144 0.752176 0.601228 0.290276
1 0.632144 0.202513 0.111766 0.317838
2 0.494587 0.318276 0.951354 0.051253
3 0.184826 0.429469 0.280297 0.014895
4 0.236955 0.560095 0.357246 0.302688
5 0.729145 0.293810 0.525223 0.744513
>>> df2 = pd.DataFrame(np.random.rand(6, 4), columns=['A', 'B', 'C', 'E'])
>>> df2
A B C E
0 0.969758 0.650887 0.821926 0.884600
1 0.657851 0.158992 0.731678 0.841507
2 0.923716 0.524547 0.783581 0.268123
3 0.935014 0.219135 0.152794 0.433324
4 0.327104 0.581433 0.474131 0.521481
5 0.366469 0.709115 0.462106 0.416601
>>> df3 = pd.concat([df1,df2], axis=0)
>>> df3
A B C D E
0 0.763144 0.752176 0.601228 0.290276 NaN
1 0.632144 0.202513 0.111766 0.317838 NaN
2 0.494587 0.318276 0.951354 0.051253 NaN
3 0.184826 0.429469 0.280297 0.014895 NaN
4 0.236955 0.560095 0.357246 0.302688 NaN
5 0.729145 0.293810 0.525223 0.744513 NaN
0 0.969758 0.650887 0.821926 NaN 0.884600
1 0.657851 0.158992 0.731678 NaN 0.841507
2 0.923716 0.524547 0.783581 NaN 0.268123
3 0.935014 0.219135 0.152794 NaN 0.433324
4 0.327104 0.581433 0.474131 NaN 0.521481
5 0.366469 0.709115 0.462106 NaN 0.416601
>>> U, s, V = np.linalg.svd(df3.values, full_matrices=True)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/numpy-1.11.0b3-py3.4-macosx-10.6-intel.egg/numpy/linalg/linalg.py", line 1359, in svd
u, s, vt = gufunc(a, signature=signature, extobj=extobj)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/numpy-1.11.0b3-py3.4-macosx-10.6-intel.egg/numpy/linalg/linalg.py", line 99, in _raise_linalgerror_svd_nonconvergence
raise LinAlgError("SVD did not converge")
numpy.linalg.linalg.LinAlgError: SVD did not converge
注意: 我无法应用插值,因为我想保留一些记录没有列信息,但其他记录有
答案 0 :(得分:5)
可以使用迭代过程逼近具有缺失值的矩阵的SVD:
这是期望最大化(EM)算法的一种形式,其中E步骤更新来自SVD的缺失值的估计,并且M步骤计算关于数据矩阵的更新估计的SVD({{3} })。
import numpy as np
from scipy.sparse.linalg import svds
from functools import partial
def emsvd(Y, k=None, tol=1E-3, maxiter=None):
"""
Approximate SVD on data with missing values via expectation-maximization
Inputs:
-----------
Y: (nobs, ndim) data matrix, missing values denoted by NaN/Inf
k: number of singular values/vectors to find (default: k=ndim)
tol: convergence tolerance on change in trace norm
maxiter: maximum number of EM steps to perform (default: no limit)
Returns:
-----------
Y_hat: (nobs, ndim) reconstructed data matrix
mu_hat: (ndim,) estimated column means for reconstructed data
U, s, Vt: singular values and vectors (see np.linalg.svd and
scipy.sparse.linalg.svds for details)
"""
if k is None:
svdmethod = partial(np.linalg.svd, full_matrices=False)
else:
svdmethod = partial(svds, k=k)
if maxiter is None:
maxiter = np.inf
# initialize the missing values to their respective column means
mu_hat = np.nanmean(Y, axis=0, keepdims=1)
valid = np.isfinite(Y)
Y_hat = np.where(valid, Y, mu_hat)
halt = False
ii = 1
v_prev = 0
while not halt:
# SVD on filled-in data
U, s, Vt = svdmethod(Y_hat - mu_hat)
# impute missing values
Y_hat[~valid] = (U.dot(np.diag(s)).dot(Vt) + mu_hat)[~valid]
# update bias parameter
mu_hat = Y_hat.mean(axis=0, keepdims=1)
# test convergence using relative change in trace norm
v = s.sum()
if ii >= maxiter or ((v - v_prev) / v_prev) < tol:
halt = True
ii += 1
v_prev = v
return Y_hat, mu_hat, U, s, Vt