Question

我有一个数据框，我知道某些列是某些其他列的完全线性公式，但我不知道它们是哪些列。

       A     B    C      D    E    G
1  -8453   319 3363 -16382 8290 2683
2   2269 -5687 5810   6626 5857 1283
3   8381  5725 1099  -6145 8507 1393
4  -2248  3936 5394 -10503 1803 7910
5   9579  4210 4027   4049 5235  112
6   7351  3717 2357  -1357 5458 1890
7  -8323 -9181 7914  -2417 2252 8937
8    731 -5936 5948  -4190 7621 9184
9  -7419  5345  218 -20339 7139  654
10 -9353  4583  444 -22751 6108 3151

DT <- structure(list(A = c(-6381L, 6029L, 171L, 6451L, -8843L, -4651L, 
-4142L, -9292L, -5857L, 3378L), B = c(-9170L, 6601L, -4307L, 
8391L, -5360L, 3783L, 4481L, 3990L, 5308L, -8744L), C = c(7899L, 
1031L, 8288L, 2034L, 2146L, 2862L, 4911L, 1808L, 4351L, 287L), 
    D = c(4772L, -12577L, 7358L, -10506L, -15314L, -17401L, -7939L, 
    -29133L, -17846L, 5631L), E = c(15L, 5708L, 5272L, 5651L, 
    8126L, 8805L, 20L, 9129L, 3786L, 5498L), G = c(5901L, 7328L, 
    136L, 4949L, 5851L, 3024L, 4207L, 8530L, 7246L, 1280L)), class = "data.frame", row.names = c(NA, 
-10L), .Names = c("A", "B", "C", "D", "E", "G"))

我最初的反应是遍历列DT并在其余列上执行lm，搜索r.squared == 1，但我想知道是否有针对此特定任务的函数

Answer 1

我会质疑你的主张（或至少我最初认为是你的主张）。我调查它的第一个工具是Hmisc::rcorr，它计算所有的相关系数。如果任何一对是另一对的线性组合，则相关系数应为1.0

> rcorr(data.matrix(DT))
      A     B     C     D     E     G
A  1.00  0.22 -0.28  0.40 -0.05 -0.35
B  0.22  1.00 -0.32 -0.67  0.18  0.44
C -0.28 -0.32  1.00  0.49 -0.58 -0.27
D  0.40 -0.67  0.49  1.00 -0.55 -0.72
E -0.05  0.18 -0.58 -0.55  1.00  0.07
G -0.35  0.44 -0.27 -0.72  0.07  1.00

事实证明它要求所有6列都具有线性相关性，因为删除任何一列都会使子矩阵满列：

sapply(1:6,  function(i) rankMatrix(as.matrix(DT[-i]))  )
[1] 5 5 5 5 5 5

与Rolands一起发表评论，了解获得完全线性依赖性的因素：

sapply(LETTERS[1:5], function(col) round( lm(as.formula(paste0(col, " ~ .")), data = DT)$coef,4)  )
             A  B  C  D  E
(Intercept)  0  0  0  0  0
B            1  1 -1  1  1
C           -1  1  1 -1 -1
D            1 -1  1  1  1
E            1 -1  1 -1 -1
G            1 -1  1 -1 -1

@Hugh：一定要在写作业作业中引用StackOverflow; - ）

这是制作类似矩阵的一种方法：

res <- replicate(5, sample((-10000):10000, 10) )
res2 <- res %*% sample(c(-1,1) , 5, repl=TRUE)
res3 <- cbind(res2, res)

然后用Dason的linfinder：

检查其中几个

> linfinder(data.matrix(res3))
[1] "Column_6 = -1*Column_1 + -1*Column_2 + -1*Column_3 + -1*Column_4 + -1*Column_5"
> res2 <- res %*% sample(c(-1,1) , 5, repl=TRUE)
> res3 <- cbind(res2, res)
> linfinder(data.matrix(res3))
[1] "Column_6 = -1*Column_1 + -0.999999999999999*Column_2 + 0.999999999999999*Column_3 + 0.999999999999999*Column_4 + 0.999999999999999*Column_5"
>

Answer 2

我的第一次猜测最终效果很好

❥ output <- lm(A ~ C + D + E + G + B, data = DT)
❥ summary(output)

Call:
lm(formula = A ~ C + D + E + G + B, data = DT)

Residuals:
        1         2         3         4         5         6         7         8
-4.80e-12  1.59e-12  3.61e-12 -2.82e-12  2.79e-12 -5.58e-12  1.49e-12 -8.34e-14
        9        10
 3.40e-12  4.10e-13

Coefficients:
             Estimate Std. Error   t value Pr(>|t|)
(Intercept)  5.75e-13   8.62e-12  7.00e-02     0.95
C           -1.00e+00   7.90e-16 -1.27e+15   <2e-16 ***
D            1.00e+00   3.94e-16  2.54e+15   <2e-16 ***
E            1.00e+00   9.46e-16  1.06e+15   <2e-16 ***
G            1.00e+00   1.17e-15  8.51e+14   <2e-16 ***
B            1.00e+00   3.85e-16  2.60e+15   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.99e-12 on 4 degrees of freedom
Multiple R-squared:     1,  Adjusted R-squared:     1
F-statistic: 2.53e+30 on 5 and 4 DF,  p-value: <2e-16

Warning message:
In summary.lm(output) : essentially perfect fit: summary may be unreliable

确定一个确切的公式

2 个答案: