Question

我正在从Stata过渡到R.在Stata中，如果我将因子级别（比如-0和1）标记为（M和F），则0和1将保持不变。此外，这在大多数软件（包括Excel和SPSS）中都需要虚拟变量线性回归。

但是，我注意到R默认因子水平为1,2而不是0,1。我不知道为什么R会这样做，尽管内部（和正确）回归假设0和1为因子变量。我将不胜感激任何帮助。

这就是我的所作所为：

尝试＃1：

sex<-c(0,1,0,1,1)
sex<-factor(sex,levels = c(1,0),labels = c("F","M"))
str(sex)
Factor w/ 2 levels "F","M": 2 1 2 1 1

似乎因子水平现在重置为1和2.我相信1和2是这里对因子水平的引用。但是，我丢失了原始值，即0和1。

Try2：

sex<-c(0,1,0,1,1)
sex<-factor(sex,levels = c(0,1),labels = c("F","M"))
str(sex)
Factor w/ 2 levels "F","M": 1 2 1 2 2

同上。我的0和1现在是1＆2和2。非常令人惊讶。为什么会这样。

Try3 现在，我想看看1和2是否有任何不良影响回归。所以，这就是我的所作所为：

这是我的数据：

> head(data.frame(sassign$total_,sassign$gender))
  sassign.total_ sassign.gender
1            357              M
2            138              M
3            172              F
4            272              F
5            149              F
6            113              F

myfit<-lm(sassign$total_ ~ sassign$gender)

myfit$coefficients
    (Intercept) sassign$genderM 
      200.63522        23.00606

因此，事实证明手段是正确的。在运行回归时，R确实使用0和1值作为假人。

我确实在SO上检查了其他线程，但他们主要讨论R代码如何在不告诉我原因的情况下对变量进行因子分析。 Stata和SPSS通常要求基本变量为＆＃34; 0。＆＃34;所以，我想到了这个问题。

我很感激任何想法。

Answer 1

R不是Stata。而且你需要忘掉很多关于虚拟变量构造的教学内容。 R为你做幕后工作。你不能使R的行为与Stata完全相同。没错，R确实有0和1＆＃39;在＆＃34; F＆＃34;的模型矩阵列中等级，但那些乘以因子值，（在这种情况下为1和2）。然而，对比总是关于差异，差异btwn（0,1）与差异btwn（1,2）相同。

数据示例：

dput(dat)
structure(list(total = c(357L, 138L, 172L, 272L, 149L, 113L), 
    gender = structure(c(2L, 2L, 1L, 1L, 1L, 1L), .Label = c("F", 
    "M"), class = "factor")), .Names = c("total", "gender"), row.names = c("1", 
"2", "3", "4", "5", "6"), class = "data.frame")

这两个回归模型具有不同的模型矩阵（模型矩阵是R如何构造其＆＃34;虚拟变量。

> myfit<-lm(total ~ gender, dat)
> 
> myfit$coefficients
(Intercept)     genderM 
      176.5        71.0 
> dat$gender=factor(dat$gender, levels=c("M","F") )
> myfit<-lm(total ~ gender, dat)
> 
> myfit$coefficients
(Intercept)     genderF 
      247.5       -71.0 
> model.matrix(myfit)
  (Intercept) genderF
1           1       0
2           1       0
3           1       1
4           1       1
5           1       1
6           1       1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$gender
[1] "contr.treatment"

> dat$gender=factor(dat$gender, levels=c("F","M") )
> myfit<-lm(total ~ gender, dat)
> 
> myfit$coefficients
(Intercept)     genderM 
      176.5        71.0 
> model.matrix(myfit)
  (Intercept) genderM
1           1       1
2           1       1
3           1       0
4           1       0
5           1       0
6           1       0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$gender
[1] "contr.treatment"

Answer 2

简而言之，您只是混淆了两个不同的概念。我将在下面逐一阐明它们。

您在str()

中看到的整数的含义

您从str()看到的是因子变量的内部表示。因子在内部是一个整数，其中数字给出了向量内的位置。例如：

x <- gl(3, 2, labels = letters[1:3]) #[1] a a b b c c #Levels: a b c storage.mode(x) ## or `typeof(x)` #[1] "integer" str(x) # Factor w/ 3 levels "a","b","c": 1 1 2 2 3 3 as.integer(x) #[1] 1 1 2 2 3 3 levels(x) #[1] "a" "b" "c"

此类职位的常见用途是以最有效的方式执行as.character(x)：

levels(x)[x] #[1] "a" "a" "b" "b" "c" "c"

您对模型矩阵的误解

在我看来，你认为模型矩阵是通过
获得的
cbind(1L, as.integer(x)) # [,1] [,2] #[1,] 1 1 #[2,] 1 1 #[3,] 1 2 #[4,] 1 2 #[5,] 1 3 #[6,] 1 3

这不是真的。以这种方式，您只是将因子变量视为数值变量。

模型矩阵以这种方式构建：

xlevels <- levels(x) cbind(1L, match(x, xlevels[2], nomatch=0), match(x, xlevels[3], nomatch=0)) # [,1] [,2] [,3] #[1,] 1 0 0 #[2,] 1 0 0 #[3,] 1 1 0 #[4,] 1 1 0 #[5,] 1 0 1 #[6,] 1 0 1

1和0分别表示“匹配”/“发生”和“不匹配”/“不发生”。

R例程model.matrix将通过易于阅读的列名和行名有效地为您完成此任务：

model.matrix(~x) # (Intercept) xb xc #1 1 0 0 #2 1 0 0 #3 1 1 0 #4 1 1 0 #5 1 0 1 #6 1 0 1

编写R函数以自行生成模型矩阵

我们可以编写名义例程mm来生成模型矩阵。虽然它的效率远低于model.matrix，但它可能有助于更好地消化这一概念。

mm <- function (x, contrast = TRUE) { xlevels <- levels(x) lst <- lapply(xlevels, function (z) match(x, z, nomatch = 0L)) if (contrast) do.call("cbind", c(list(1L), lst[-1])) else do.call("cbind", lst) }

例如，如果我们的因子y有5个级别：

set.seed(1); y <- factor(sample(1:5, 10, replace=TRUE), labels = letters[1:5]) y # [1] b b c e b e e d d a #Levels: a b c d e str(y) #Factor w/ 5 levels "a","b","c","d",..: 2 2 3 5 2 5 5 4 4 1

其有/无对比处理的模型矩阵分别为：

mm(y, TRUE) # [,1] [,2] [,3] [,4] [,5] # [1,] 1 1 0 0 0 # [2,] 1 1 0 0 0 # [3,] 1 0 1 0 0 # [4,] 1 0 0 0 1 # [5,] 1 1 0 0 0 # [6,] 1 0 0 0 1 # [7,] 1 0 0 0 1 # [8,] 1 0 0 1 0 # [9,] 1 0 0 1 0 #[10,] 1 0 0 0 0 mm(y, FALSE) # [,1] [,2] [,3] [,4] [,5] # [1,] 0 1 0 0 0 # [2,] 0 1 0 0 0 # [3,] 0 0 1 0 0 # [4,] 0 0 0 0 1 # [5,] 0 1 0 0 0 # [6,] 0 0 0 0 1 # [7,] 0 0 0 0 1 # [8,] 0 0 0 1 0 # [9,] 0 0 0 1 0 #[10,] 1 0 0 0 0

相应的model.matrix电话将分别为：

model.matrix(~ y) model.matrix(~ y - 1)

R |中的因子水平默认为1和2虚拟变量

2 个答案: