Question

我跑了回归：

start_question = str(input("keywords are pig , cat , dog "))


with open('D:\Program Files (x86)\Python Programming\python.txt'):
   if any(word in start_question for word in ('D:\Program Files (x86)\PythonProgramming\python.txt')):
   print ("yes")


else:
    print ("no")

我的任务是获得

给定CopierDataRegression <- lm(V1~V2, data=CopierData1)和

置信区间

90％预测间隔 V2=6。

我使用了以下代码：

V2=6

我得到X6 <- data.frame(V2=6) predict(CopierDataRegression, X6, se.fit=TRUE, interval="confidence", level=0.90) predict(CopierDataRegression, X6, se.fit=TRUE, interval="prediction", level=0.90)和(87.3, 91.9)似乎是正确的，因为PI应该更宽。

两者的输出也包括(74.5, 104.8)，它们是相同的。 我不明白这个标准错误是什么。 PI与CI之间的标准错误不应该更大吗？如何在R？中找到这两个不同的标准错误

数据：

se.fit = 1.39

Answer 1

指定interval和level参数时，predict.lm可以返回置信区间（CI）或预测区间（PI）。此答案显示如何在不设置这些参数的情况下获取CI和PI。有两种方法：

使用predict.lm;
从零开始做一切。

了解如何使用这两种方式可以让您彻底了解预测过程。

请注意，我们仅涵盖type = "response"的{{1}}（默认）案例。对predict.lm的讨论超出了这个答案的范围。

设置

我在这里收集你的代码，以帮助其他读者复制，粘贴和运行。我还更改变量名称，以便它们具有更清晰的含义。另外，我将type = "terms"扩展为包含多行，以显示我们的计算是“矢量化”。

newdat

以下是dat <- structure(list(V1 = c(20L, 60L, 46L, 41L, 12L, 137L, 68L, 89L, 4L, 32L, 144L, 156L, 93L, 36L, 72L, 100L, 105L, 131L, 127L, 57L, 66L, 101L, 109L, 74L, 134L, 112L, 18L, 73L, 111L, 96L, 123L, 90L, 20L, 28L, 3L, 57L, 86L, 132L, 112L, 27L, 131L, 34L, 27L, 61L, 77L), V2 = c(2L, 4L, 3L, 2L, 1L, 10L, 5L, 5L, 1L, 2L, 9L, 10L, 6L, 3L, 4L, 8L, 7L, 8L, 10L, 4L, 5L, 7L, 7L, 5L, 9L, 7L, 2L, 5L, 7L, 6L, 8L, 5L, 2L, 2L, 1L, 4L, 5L, 9L, 7L, 1L, 9L, 2L, 2L, 4L, 5L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA, -45L)) lmObject <- lm(V1 ~ V2, data = dat) newdat <- data.frame(V2 = c(6, 7))的输出，稍后将与我们的手动计算进行比较。

predict.lm

使用`predict(lmObject, newdat, se.fit = TRUE, interval = "confidence", level = 0.90) #$fit # fit lwr upr #1 89.63133 87.28387 91.9788 #2 104.66658 101.95686 107.3763 # #$se.fit # 1 2 #1.396411 1.611900 # #$df #[1] 43 # #$residual.scale #[1] 8.913508 predict(lmObject, newdat, se.fit = TRUE, interval = "prediction", level = 0.90) #$fit # fit lwr upr #1 89.63133 74.46433 104.7983 #2 104.66658 89.43930 119.8939 # #$se.fit # 1 2 #1.396411 1.611900 # #$df #[1] 43 # #$residual.scale #[1] 8.913508`

的中期结果

predict.lm

什么是## use `se.fit = TRUE` z <- predict(lmObject, newdat, se.fit = TRUE) #$fit # 1 2 # 89.63133 104.66658 # #$se.fit # 1 2 #1.396411 1.611900 # #$df #[1] 43 # #$residual.scale #[1] 8.913508？

se.fit是预测平均值z$se.fit的标准误差，用于构建z$fit的CI。我们还需要具有自由度{t}分布的分位数z$fit。

z$df

我们认为这与alpha <- 0.90 ## 90% Qt <- c(-1, 1) * qt((1 - alpha) / 2, z$df, lower.tail = FALSE) #[1] -1.681071 1.681071 ## 90% confidence interval CI <- z$fit + outer(z$se.fit, Qt) colnames(CI) <- c("lwr", "upr") CI # lwr upr #1 87.28387 91.9788 #2 101.95686 107.3763一致。

PI的标准错误是什么？

PI比CI更宽，因为它考虑了剩余方差：

predict.lm(, interval = "confidence")

请注意，这是逐点定义的。对于非加权线性回归（如在您的示例中），残差方差在任何地方都相等（称为同方差），它是variance_of_PI = variance_of_CI + variance_of_residual。因此，PI的标准误差是

z$residual.scale ^ 2

，PI构造为

se.PI <- sqrt(z$se.fit ^ 2 + z$residual.scale ^ 2)
#       1        2 
#9.022228 9.058082

我们认为这与PI <- z$fit + outer(se.PI, Qt) colnames(PI) <- c("lwr", "upr") PI # lwr upr #1 74.46433 104.7983 #2 89.43930 119.8939一致。

<强>备注

如果你有权重线性回归，那么事情会更复杂，其中残差方差在任何地方都不相等，因此predict.lm(, interval = "prediction")应该加权。为拟合值构造PI更容易（也就是说，在z$residual.scale ^ 2中使用newdata时未设置type = "prediction"，因为权重已知（您必须通过使用predict.lm时的weight参数。对于样本外预测（即，您将lm传递给newdata），predict.lm期望您告诉它应如何对残差方差进行加权。您需要在predict.lm中使用参数pred.var或weights，否则会收到来自predict.lm的警告，抱怨构建PI的信息不足。以下引自predict.lm：

?predict.lm

请注意，CI的构建不受回归类型的影响。

从头开始做一切

基本上，我们想知道如何在The prediction intervals are for a single observation at each case in ‘newdata’ (or by default, the data used for the fit) with error variance(s) ‘pred.var’. This can be a multiple of ‘res.var’, the estimated value of sigma^2: the default is to assume that future observations have the same error variance as those used for fitting. If ‘weights’ is supplied, the inverse of this is used as a scale factor. For a weighted fit, if the prediction is for the original data frame, ‘weights’ defaults to the weights used for the model fit, with a warning since it might not be the intended result. If the fit was weighted and ‘newdata’ is given, the default is to assume constant prediction variance, with a warning.中获取fit，se.fit，df和residual.scale。

预测均值可以通过矩阵向量乘法z计算，其中Xp %*% b是线性预测矩阵，Xp是回归系数向量。

我们认为这与Xp <- model.matrix(delete.response(terms(lmObject)), newdat) b <- coef(lmObject) yh <- c(Xp %*% b) ## c() reshape the single-column matrix to a vector #[1] 89.63133 104.66658一致。 z$fit的方差 - 协方差为yh，其中Xp %*% V %*% t(Xp)是V的方差 - 协方差矩阵，可通过

计算

计算逐点CI或PI不需要V <- vcov(lmObject) ## use `vcov` function in R # (Intercept) V2 # (Intercept) 7.862086 -1.1927966 # V2 -1.192797 0.2333733的完全方差 - 协方差矩阵。我们只需要它的主要对角线。因此，我们可以通过

更有效地完成yh，而不是diag(Xp %*% V %*% t(Xp))

var.fit <- rowSums((Xp %*% V) * Xp)  ## point-wise variance for predicted mean
#       1        2 
#1.949963 2.598222 

sqrt(var.fit)  ## this agrees with `z$se.fit`
#       1        2 
#1.396411 1.611900

在拟合模型中可以随时获得剩余自由度：

dof <- df.residual(lmObject)
#[1] 43

最后，要计算残差方差，请使用Pearson估算器：

sig2 <- c(crossprod(lmObject$residuals)) / dof
# [1] 79.45063

sqrt(sig2)  ## this agrees with `z$residual.scale`
#[1] 8.913508

<强>备注

请注意，在加权回归的情况下，sig2应计算为

sig2 <- c(crossprod(sqrt(lmObject$weights) * lmObject$residuals)) / dof

附录：一个模仿`predict.lm`

的自编函数

在这个问答环节中，“从头开始做所有事情”中的代码已经干净地组织成一个函数lm_predict。答：linear model with lm: how to get prediction variance of sum of predicted values。

Answer 2

我不知道是否有一种快速的方法来提取预测间隔的标准误差，但是你总是可以反算SE的间隔（即使它不是超级优雅的方法）：

m <- lm(V1 ~ V2, data = d)                                                                                                                                                                                                                

newdat <- data.frame(V2=6)                                                                                                                                                                                                                
tcrit <- qt(0.95, m$df.residual)                                                                                                                                                                                                          

a <- predict(m, newdat, interval="confidence", level=0.90)                                                                                                                                                                                
cat("CI SE", (a[1, "upr"] - a[1, "fit"]) / tcrit, "\n")                                                                                                                                                                                   

b <- predict(m, newdat, interval="prediction", level=0.90)                                                                                                                                                                                
cat("PI SE", (b[1, "upr"] - b[1, "fit"]) / tcrit, "\n")

请注意，CI SE与se.fit的值相同。

predict.lm（）如何计算置信区间和预测区间？

2 个答案:

设置

从头开始做一切

附录：一个模仿`predict.lm`

predict.lm（）如何计算置信区间和预测区间？

2 个答案:

设置

从头开始做一切

附录：一个模仿predict.lm

附录：一个模仿`predict.lm`