在使用正则表达式之前,我已经为此制定了一个非常冗长的解决方案,但我希望有更原生的方法来实现它。
给定一个模型,或许像
data(tips, package="reshape2")
mod <- lm(tip ~ total_bill*sex + sex*day, tips)
mod$coefficients
我想确定哪些系数与公式中的哪个变量一致。像这样:
| Coefficient | Variable |
|:-------------------|:---------------|
| total_bill | total_bill |
| sexMale | sex |
| daySat | day |
| daySun | day |
| dayThur | day |
| total_bill:sexMale | total_bill,sex |
| sexMale:daySat | sex,day |
| sexMale:daySun | sex,day |
| sexMale:dayThur | sex,day |
我已经检查了model.matrix
和model.formula
,但那些让我看到了这行代码
.Internal(model.matrix(t, data))
我潜入C代码,但我认为必须有一个更简单的方法。有吗?
为了回应DWin的好回答,我构建了一个复杂的例子,正则表达式可能会失败。这是正则表达式让我害怕的边缘案例之一。
data.frame
是使用变量名称和值构建的,很容易混淆,这种情况很多。
baseball <- data.frame(Bat=sample(1:100, 20, replace=T), Batter=sample(c("David", "Batley", "Bob", "Ace"), 20, replace=T), Hits=sample(1:20, 20, replace=T))
baseball
bMod <- lm(Hits ~ Bat*Batter, baseball)
bMod$coefficients
col.matx <- sapply(colnames(model.matrix(bMod)), function(cols) sapply(labels(bMod), function(trm) grep(patt=trm, x=cols, value=TRUE)))
这会将连续变量Bat
与factor
Batter
的所有系数相匹配。
是的,这是一个愚蠢的例子,但很容易发生。
答案 0 :(得分:0)
这不是确切的答案,但应该能够找到一种方法来重新排列它以适合您的目的。第一步确定每个labels
列名称中model.matrix
中的哪一个。
col.matx <- sapply(colnames(model.matrix(mod)), function(cols)
sapply(labels(mod), function(trm) grep(patt=trm, x=cols, value=TRUE)))
#---------------------------------------
(Intercept) total_bill sexMale daySat daySun dayThur
total_bill Character,0 "total_bill" Character,0 Character,0 Character,0 Character,0
sex Character,0 Character,0 "sexMale" Character,0 Character,0 Character,0
day Character,0 Character,0 Character,0 "daySat" "daySun" "dayThur"
total_bill:sex Character,0 Character,0 Character,0 Character,0 Character,0 Character,0
sex:day Character,0 Character,0 Character,0 Character,0 Character,0 Character,0
total_bill:sexMale sexMale:daySat sexMale:daySun sexMale:dayThur
total_bill "total_bill:sexMale" Character,0 Character,0 Character,0
sex "total_bill:sexMale" "sexMale:daySat" "sexMale:daySun" "sexMale:dayThur"
day Character,0 "sexMale:daySat" "sexMale:daySun" "sexMale:dayThur"
total_bill:sex "total_bill:sexMale" Character,0 Character,0 Character,0
sex:day Character,0 Character,0 Character,0 Character,0
当值不是零长度项时,结果的rownames需要与矩阵值关联:
> which(sapply(col.matx, length) != 0 , arr.ind=TRUE)
[1] 6 12 18 23 28 31 32 34 37 38 42 43 47 48
因此,这将从上面的矩阵中挑选出项目名称,并将它们与term.labels相关联 使用模运算:
data.frame(coef = unlist(col.matx[
which(sapply(col.matx, length) != 0 , arr.ind=TRUE)] ),
term.label =rownames(col.matx)[
which(sapply(col.matx, length) != 0 , arr.ind=TRUE) %% 5 ])
coef term.label
1 total_bill total_bill
2 sexMale sex
3 daySat day
4 daySun day
5 dayThur day
6 total_bill:sexMale total_bill
7 total_bill:sexMale sex
8 total_bill:sexMale total_bill:sex
9 sexMale:daySat sex
10 sexMale:daySat day
11 sexMale:daySun sex
12 sexMale:daySun day
13 sexMale:dayThur sex
14 sexMale:dayThur day
折叠值方法是对SO的频繁请求。有一个在过去24小时内得到了解答。
答案 1 :(得分:0)
好的,使用完全包含在模型中的信息为lm模型找到了解决方案。
require(plyr) # for join function
require(reshape2) # for melt function
matchCoefs <- function(model)
{
# get the terms
theTerms <- model$terms
# get the assignment position
thePos <- model$assign
# get intercept indicator
inter <- attr(theTerms, "intercept")
# get coef names
coefNames <- names(coef(model))
# get pred names
predNames <- attr(theTerms, "term.labels")
# expand out pred names to match coefficient names
predNames <- predNames[thePos]
# if there's an intercept term add it to the pred names
if(inter == 1)
{
predNames <- c("(Intercept)", predNames)
}
# build data.frame linking term to coefficient name
matching <- data.frame(Term=predNames, Coefficient=coefNames)
## now match individual predictor to term
# get matrix as data.frame
factorMat <- as.data.frame(attr(theTerms, "factor"))
# add column from rownames as identifier
factorMat$.Pred <- rownames(factorMat)
# melt it down for comparison
factorMelt <- melt(factorMat, id.vars=".Pred", variable.name="Term", )
# only keep rows where there's a match
factorMelt <- factorMelt[factorMelt$value == 1, ]
# again, bring in coefficient if needed
if(inter == 1)
{
factorMelt <- rbind(data.frame(.Pred="(Intercept)", Term="(Intercept)", value=1), factorMelt)
}
# join into the matching data.frame
matching <- join(matching, factorMelt, by="Term")
return(matching)
}
# fit some models with different terms
mod1 <- lm(tip ~ total_bill * sex + day, tips)
mod2 <- lm(tip ~ total_bill * sex + day - 1, tips)
mod3 <- lm(tip ~ (total_bill + sex + day)^3, tips)
mod4 <- lm(tip ~ total_bill * sex + day + I(total_bill^2), tips)
matchCoefs(mod1)
matchCoefs(mod2)
matchCoefs(mod3)
matchCoefs(mod4)
# now with the convoluted baseball example
baseball <- data.frame(Bat=sample(1:100, 20, replace=T), Batter=sample(c("David", "Batley", "Bob", "Ace"), 20, replace=T), Hits=sample(1:20, 20, replace=T))
bMod <- lm(Hits ~ Bat*Batter, baseball)
matchCoefs(bMod)
没有循环,所有内置函数,没有正则表达式。我需要对此进行更多测试并输入数据类型信息,但这应该相对简单。