如何填充数值数组中的NaN值以应用SVD?

时间:2016-02-23 12:26:17

标签: python python-3.x numpy svd

我合并了两个具有一些常见列的数据框,但是有一些不同的列。我想在组合数据帧上应用奇异值分解(SVD)。但是,填充NaN值会影响结果,即使用零填充数据也是错误的,因为有些列的值为零。这是一个例子。有没有办法解决这个问题?。

>>> df1 = pd.DataFrame(np.random.rand(6, 4), columns=['A', 'B', 'C', 'D'])
>>> df1
          A         B         C         D
0  0.763144  0.752176  0.601228  0.290276
1  0.632144  0.202513  0.111766  0.317838
2  0.494587  0.318276  0.951354  0.051253
3  0.184826  0.429469  0.280297  0.014895
4  0.236955  0.560095  0.357246  0.302688
5  0.729145  0.293810  0.525223  0.744513
>>> df2 = pd.DataFrame(np.random.rand(6, 4), columns=['A', 'B', 'C', 'E'])
>>> df2
          A         B         C         E
0  0.969758  0.650887  0.821926  0.884600
1  0.657851  0.158992  0.731678  0.841507
2  0.923716  0.524547  0.783581  0.268123
3  0.935014  0.219135  0.152794  0.433324
4  0.327104  0.581433  0.474131  0.521481
5  0.366469  0.709115  0.462106  0.416601
>>> df3 = pd.concat([df1,df2], axis=0)
>>> df3
          A         B         C         D         E
0  0.763144  0.752176  0.601228  0.290276       NaN
1  0.632144  0.202513  0.111766  0.317838       NaN
2  0.494587  0.318276  0.951354  0.051253       NaN
3  0.184826  0.429469  0.280297  0.014895       NaN
4  0.236955  0.560095  0.357246  0.302688       NaN
5  0.729145  0.293810  0.525223  0.744513       NaN
0  0.969758  0.650887  0.821926       NaN  0.884600
1  0.657851  0.158992  0.731678       NaN  0.841507
2  0.923716  0.524547  0.783581       NaN  0.268123
3  0.935014  0.219135  0.152794       NaN  0.433324
4  0.327104  0.581433  0.474131       NaN  0.521481
5  0.366469  0.709115  0.462106       NaN  0.416601
>>> U, s, V = np.linalg.svd(df3.values, full_matrices=True)

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/numpy-1.11.0b3-py3.4-macosx-10.6-intel.egg/numpy/linalg/linalg.py", line 1359, in svd
    u, s, vt = gufunc(a, signature=signature, extobj=extobj)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/numpy-1.11.0b3-py3.4-macosx-10.6-intel.egg/numpy/linalg/linalg.py", line 99, in _raise_linalgerror_svd_nonconvergence
    raise LinAlgError("SVD did not converge")
numpy.linalg.linalg.LinAlgError: SVD did not converge

注意: 我无法应用插值,因为我想保留一些记录没有列信息,但其他记录有

1 个答案:

答案 0 :(得分:5)

可以使用迭代过程逼近具有缺失值的矩阵的SVD:

  1. 使用粗略近似值填写缺失值(例如,用列方法替换它们)
  2. 在填充矩阵上执行SVD
  3. 从SVD重建数据矩阵,以便更好地逼近缺失值
  4. 重复步骤2-3直到收敛
  5. 这是期望最大化(EM)算法的一种形式,其中E步骤更新来自SVD的缺失值的估计,并且M步骤计算关于数据矩阵的更新估计的SVD({{3} })。

    import numpy as np
    from scipy.sparse.linalg import svds
    from functools import partial
    
    
    def emsvd(Y, k=None, tol=1E-3, maxiter=None):
        """
        Approximate SVD on data with missing values via expectation-maximization
    
        Inputs:
        -----------
        Y:          (nobs, ndim) data matrix, missing values denoted by NaN/Inf
        k:          number of singular values/vectors to find (default: k=ndim)
        tol:        convergence tolerance on change in trace norm
        maxiter:    maximum number of EM steps to perform (default: no limit)
    
        Returns:
        -----------
        Y_hat:      (nobs, ndim) reconstructed data matrix
        mu_hat:     (ndim,) estimated column means for reconstructed data
        U, s, Vt:   singular values and vectors (see np.linalg.svd and 
                    scipy.sparse.linalg.svds for details)
        """
    
        if k is None:
            svdmethod = partial(np.linalg.svd, full_matrices=False)
        else:
            svdmethod = partial(svds, k=k)
        if maxiter is None:
            maxiter = np.inf
    
        # initialize the missing values to their respective column means
        mu_hat = np.nanmean(Y, axis=0, keepdims=1)
        valid = np.isfinite(Y)
        Y_hat = np.where(valid, Y, mu_hat)
    
        halt = False
        ii = 1
        v_prev = 0
    
        while not halt:
    
            # SVD on filled-in data
            U, s, Vt = svdmethod(Y_hat - mu_hat)
    
            # impute missing values
            Y_hat[~valid] = (U.dot(np.diag(s)).dot(Vt) + mu_hat)[~valid]
    
            # update bias parameter
            mu_hat = Y_hat.mean(axis=0, keepdims=1)
    
            # test convergence using relative change in trace norm
            v = s.sum()
            if ii >= maxiter or ((v - v_prev) / v_prev) < tol:
                halt = True
            ii += 1
            v_prev = v
    
        return Y_hat, mu_hat, U, s, Vt