Question

我对统计数据和sklearn相当新鲜，所以如果这是一个非常基本的问题，请原谅我。我有一个m×n矩阵（数千行，数百列），我试图在矩阵中找到矩阵中其他列的线性组合的列，以便标记和删除它们。

在R中，我能够使用lm（）函数并找到具有NA系数的变量来查找作为其他变量的线性组合的变量，但是sklearn线性回归基于返回不同结果的不同实现。

到目前为止，我尝试使用QR分解来寻找线性独立和依赖的列，但结果不正确（例如下面，我在一个矩阵上运行了np.linalg.qr（）所有1的列，并没有将列d标记为＆＃34;坏＆＃34;列）。

我这样做是为了确定给定矩阵是否排名不足，以及哪些列可以删除。有没有人知道我的QR分解有什么问题，或者是否有另一种方法可以确定矩阵是否排名不足/哪些列是其他列的线性组合？

小QR分解尝试（找到HERE）：

import pandas as pd
from sklearn import linear_model
import numpy as np

a = [546, 42, 68, 15, 47, 12, 154, 45]
b = [4, 6, 34, 2, 8, 24, 35, 93]
c = [44, 55, 66, 77, 88, 12, 41, 32]
d = [1, 1, 1, 1, 1, 1, 1, 1]
e = [1, 1, 1, 1, 1, 1, 1, 1.1]
f = [1, 1, 1, 1, 1, 1, 1, 1.01]
g = [1, 1, 1, 1, 1, 1, 1, 1.001]


df = pd.DataFrame({'a': a,
                 'b': b,
                 'c': c,
                 'd': d,
                 'e': e,
                 'f': f,
                 'g': g })


# get R matrix from QR decomposition
R = np.linalg.qr(df)[1] 
R = abs(R)
sums = R.sum(axis=1)

columns = df.columns.tolist()

i = 0
# rows with sum near 0 are linearly dependent column in the 
# original matrix
while( i < df.shape[1] ): 
    if(sums[i] > 1.e-10):
        print("{} is a good column!".format(columns[i]))
    else:
        print("{} is a bad column!".format(columns[i]))
    i += 1

输出：

a is a good column!
b is a good column!
c is a good column!
d is a good column!
e is a good column!
f is a bad column!
g is a bad column!

在R中，此数据集的输出是相同的，除了col d被标记为坏（因为它应该是，因为它是所有1的列）。

查找作为其他列的线性组合的列

0 个答案: