AICc如何处理线性模型中的名义变量与数字变量?

时间:2018-07-17 10:44:37

标签: r statistics linear-regression

今天,我对线性模型中的数值和名义变量的处理有一个疑问,我的目标是进行比较与二阶Akaike信息准则AICc,包装:'MuMIn'),用于小n。

以下是一些虚构数据和准备代码:

library(MASS)
library(MuMIn)

set.seed(123)
treatments <- c(rep(paste0('t', 1:6), each = 3)) # nominal variable
x <- abs(rnorm(mean = 9500,n = 18,sd = 20000)) # observation
var3 <- runif(n=18, min = 100, max=1000)
var2 <- rnorm(n = 18, mean = 50)
var1 <- c(runif(n=3, min = 80, max=100), # numerical dummy variable for t1
      runif(n=3, min = 65, max=85),  # t2
      runif(n=3, min = 75, max=90), # t3
      runif(n=3, min = 15, max=50), # t4
      runif(n=3, min = 0, max=20), # t5
      runif(n=3, min = 30, max=60)) #t6
boxplot(var1~treatments) # well-separated for each treatment: use as dummy
dat <- data.frame(x, var1, var2, var3, treatments)

说明:我们有一个观察值,我们想知道治疗1-6的效果。数据包含不同处理的名义变量,并且偶然地我们有一个数值变量,可用作单个处理的虚拟/代理人。

这里是线性建模:

lm.nominal.1 <- lm(formula = x~treatments, data = dat)
qqnorm(rstudent(lm.nominal.1)); qqline(rstudent(lm.nominal.1)) # does not look too well
plot(rstudent(lm.nominal.1)~fitted(lm.nominal.1)) ; abline(h=0, col='red') # same here

# so let's log-transform:
 lm.nominal.1.log <- lm(formula = log(x)~treatments, data = dat)
qqnorm(rstudent(lm.nominal.1.log)); qqline(rstudent(lm.nominal.1.log)) # much better
plot(rstudent(lm.nominal.1.log)~fitted(lm.nominal.1.log)) ; abline(h=0, col='red') # same here

# ... in accordance to above
lm.nominal.2.log <- lm(formula = log(x)~treatments+var2, data = dat)
lm.nominal.3.log <- lm(formula = log(x)~treatments+var2+var3, data = dat)

lm.numeric.1.log <- lm(formula = log(x)~var1, data = dat) 
lm.numeric.2.log <- lm(formula = log(x)~var1+var2, data = dat)
lm.numeric.3.log <- lm(formula = log(x)~var1+var2+var3, data = dat)

这是赤池准则:

AICc.nominals <- AICc(lm.nominal.1.log, lm.nominal.2.log, lm.nominal.3.log)
AICc.nominals

AICc.numerics <- AICc(lm.numeric.1.log, lm.numeric.2.log, lm.numeric.3.log)
AICc.numerics

AICc.all <- AICc(lm.nominal.1.log, lm.nominal.2.log, lm.nominal.3.log,
             lm.numeric.1.log, lm.numeric.2.log, lm.numeric.3.log)    
# Now further model / likelihood analysis:
AICc.all$Deltai <- AICc.all$AICc - min(AICc.all$AICc)
AICc.all$Weights <- Weights(AICc(lm.nominal.1.log, lm.nominal.2.log, 
lm.nominal.3.log,lm.numeric.1.log, lm.numeric.2.log, lm.numeric.3.log)) 

现在,让我改一下我的问题:

可以将包含数值虚拟变量的线性模型与包含名义变量的线性模型进行比较吗?还是像比较苹果和橘子?

1 个答案:

答案 0 :(得分:1)

lm在内部进行伪编码。如果您手动执行此操作,则会得到完全相同的结果:

fit1 <- lm(Sepal.Length ~ Species, iris)
fit2 <- lm(Sepal.Length ~ model.matrix(fit1), iris)
AIC(fit1, fit2)
#  df     AIC
#fit1  4 231.452
#fit2  4 231.452

是的,没关系。