我正在尝试完善一种比较回归和PCA的方法,受到博客Cerebral Mastication的启发,该博客也已在SO的不同角度进行了讨论。在我忘记之前,非常感谢JD Long和Josh Ulrich的大部分核心。我将在下学期的课程中使用它。对不起,这很长!
更新:我发现了一种几乎可以使用的不同方法(如果可以,请修复它!)。我把它贴在了底部。比我想出的更聪明,更短的方法!
我基本上遵循了以前的方案,直到某一点:生成随机数据,找出最佳拟合线,绘制残差。这显示在下面的第二个代码块中。但我也在挖掘并编写了一些函数来通过随机点(在这种情况下为数据点)绘制垂直于直线的直线。我认为这些工作正常,并且它们在First Code Chunk中显示,并且证明它们有效。
现在,Second Code Chunk使用与@JDLong相同的流程显示整个操作,我正在添加结果图的图像。黑色,红色的数据是回归,残差是粉红色,蓝色是第一个PC,浅蓝色应该是法线,但显然它们不是。第一代码块中绘制这些法线的函数似乎很好,但是演示中的某些东西是不正确的:我认为我必须误解某些东西或传递错误的值。我的法线是水平的,这似乎是一个有用的线索(但到目前为止,不是我)。谁能看到这里有什么问题?
谢谢,这让我感到烦恼......
第一个代码块(绘制法线和证明它们起作用的函数):
##### The functions below are based very loosely on the citation at the end
pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
# Px, Py is the point to test, can be a vector.
# slope, intercept is the line to check distance.
Ax <- Px-10*diff(range(Px))
Bx <- Px+10*diff(range(Px))
Ay <- Ax * slope + intercept
By <- Bx * slope + intercept
pointOnLine(Px, Py, Ax, Ay, Bx, By)
}
pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {
# This approach based upon comingstorm's answer on
# stackoverflow.com/questions/3120357/get-closest-point-to-a-line
# Vectorized by Bryan
PB <- data.frame(x = Px - Bx, y = Py - By)
AB <- data.frame(x = Ax - Bx, y = Ay - By)
PB <- as.matrix(PB)
AB <- as.matrix(AB)
k_raw <- k <- c()
for (n in 1:nrow(PB)) {
k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
if (k_raw[n] < 0) { k[n] <- 0
} else { if (k_raw[n] > 1) k[n] <- 1
else k[n] <- k_raw[n] }
}
x = (k * Ax + (1 - k)* Bx)
y = (k * Ay + (1 - k)* By)
ans <- data.frame(x, y)
ans
}
# The following proves that pointOnLineNearPoint
# and pointOnLine work properly and accept vectors
par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
# and right angles don't appear as right angles
m <- runif(1, -5, 5)
b <- runif(1, -20, 20)
plot(-20:20, -20:20, type = "n", xlab = "x values", ylab = "y values")
abline(b, m )
Px <- rnorm(10, 0, 4)
Py <- rnorm(10, 0, 4)
res <- pointOnLineNearPoint(Px, Py, m, b)
points(Px, Py, col = "red")
segments(Px, Py, res[,1], res[,2], col = "blue")
##========================================================
##
## Credits:
## Theory by Paul Bourke http://local.wasp.uwa.edu.au/~pbourke/geometry/pointline/
## Based in part on C code by Damian Coventry Tuesday, 16 July 2002
## Based on VBA code by Brandon Crosby 9-6-05 (2 dimensions)
## With grateful thanks for answering our needs!
## This is an R (http://www.r-project.org) implementation by Gregoire Thomas 7/11/08
##
##========================================================
第二代码块(绘制示范图):
set.seed(55)
np <- 10 # number of data points
x <- 1:np
e <- rnorm(np, 0, 60)
y <- 12 + 5 * x + e
par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals")
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")
# pca "by hand"
xyNorm <- cbind(x = x - mean(x), y = y - mean(y)) # mean centers
xyCov <- cov(xyNorm)
eigenValues <- eigen(xyCov)$values
eigenVectors <- eigen(xyCov)$vectors
# Add the first PC by denormalizing back to original coords:
new.y <- (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y)
lines(x, new.y, col = "blue", lwd = 2)
# Now add the normals
yx2.lm <- lm(new.y ~ x) # zero residuals: already a line
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
points(res[,1], res[,2], col = "blue", pch = 20) # segments should end here
segments(x, y, res[,1], res[,2], col = "lightblue1") # the normals
############更新
在Vincent Zoonekynd's Page结束时,我发现了几乎我想要的东西。但是,它不太有效(显然习惯了)。以下是该网站的代码摘录,其中绘制了通过垂直轴反射的第一台PC的法线:
set.seed(1)
x <- rnorm(20)
y <- x + rnorm(20)
plot(y~x, asp = 1)
r <- lm(y~x)
abline(r, col='red')
r <- princomp(cbind(x,y))
b <- r$loadings[2,1] / r$loadings[1,1]
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue")
title(main='Appears to use the reflection of PC1')
u <- r$loadings
# Projection onto the first axis
p <- matrix( c(1,0,0,0), nrow=2 )
X <- rbind(x,y)
X <- r$center + solve(u, p %*% u %*% (X - r$center))
segments( x, y, X[1,], X[2,] , col = "lightblue1")
结果如下:
答案 0 :(得分:7)
好吧,我将不得不回答我自己的问题!在进一步阅读和比较人们上网的方法后,我解决了这个问题。我不确定我是否可以清楚地陈述我“固定”的内容,因为我经历了相当多的迭代。无论如何,这里是情节和代码(MWE)。为了清楚起见,帮助函数结束了。
# Comparison of Linear Regression & PCA
# Generate sample data
set.seed(39) # gives a decent-looking example
np <- 10 # number of data points
x <- -np:np
e <- rnorm(length(x), 0, 10)
y <- rnorm(1, 0, 2) * x + 3*rnorm(1, 0, 2) + e
# Plot the main data & residuals
plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals", asp = 1)
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")
# Now the PCA using built-in functions
# rotation = loadings = eigenvectors
r <- prcomp(cbind(x,y), retx = TRUE)
b <- r$rotation[2,1] / r$rotation[1,1] # gets slope of loading/eigenvector 1
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue") # Plot 1st PC
# Plot normals to 1st PC
X <- pointOnLineNearPoint(x, y, b, a)
segments( x, y, X[,1], X[,2], col = "lightblue1")
###### Needed Functions
pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
# Px, Py is the point to test, can be a vector.
# slope, intercept is the line to check distance.
Ax <- Px-10*diff(range(Px))
Bx <- Px+10*diff(range(Px))
Ay <- Ax * slope + intercept
By <- Bx * slope + intercept
pointOnLine(Px, Py, Ax, Ay, Bx, By)
}
pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {
# This approach based upon comingstorm's answer on
# stackoverflow.com/questions/3120357/get-closest-point-to-a-line
# Vectorized by Bryan
PB <- data.frame(x = Px - Bx, y = Py - By)
AB <- data.frame(x = Ax - Bx, y = Ay - By)
PB <- as.matrix(PB)
AB <- as.matrix(AB)
k_raw <- k <- c()
for (n in 1:nrow(PB)) {
k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
if (k_raw[n] < 0) { k[n] <- 0
} else { if (k_raw[n] > 1) k[n] <- 1
else k[n] <- k_raw[n] }
}
x = (k * Ax + (1 - k)* Bx)
y = (k * Ay + (1 - k)* By)
ans <- data.frame(x, y)
ans
}
答案 1 :(得分:1)
尝试更改代码的这一行:
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
到
res <- pointOnLineNearPoint(x, new.y, yx2.lm$coef[2], yx2.lm$coef[1])
所以你要调用正确的y值。
答案 2 :(得分:1)
在Vincent Zoonekynd's code中,将行u <- r$loadings
更改为u <- solve(r$loadings)
。在solve()
的第二个实例中,沿着第一主轴的预测分量分数(即,预测分数的矩阵,其中第二预测分量分数设置为零)需要乘以逆 of loadings / eigenvectors。通过载荷乘以数据得出预测分数;将预测得分除以载荷得到数据。希望有所帮助。