Question

在使用正则表达式之前，我已经为此制定了一个非常冗长的解决方案，但我希望有更原生的方法来实现它。

给定一个模型，或许像

data(tips, package="reshape2")
mod <- lm(tip ~ total_bill*sex + sex*day, tips)
mod$coefficients

我想确定哪些系数与公式中的哪个变量一致。像这样：

|    Coefficient     |    Variable    |
|:-------------------|:---------------|
| total_bill         | total_bill     |
| sexMale            | sex            |
| daySat             | day            |
| daySun             | day            |
| dayThur            | day            |
| total_bill:sexMale | total_bill,sex |
| sexMale:daySat     | sex,day        |
| sexMale:daySun     | sex,day        |
| sexMale:dayThur    | sex,day        |

我已经检查了model.matrix和model.formula，但那些让我看到了这行代码

.Internal(model.matrix(t, data))

我潜入C代码，但我认为必须有一个更简单的方法。有吗？

为了回应DWin的好回答，我构建了一个复杂的例子，正则表达式可能会失败。这是正则表达式让我害怕的边缘案例之一。

data.frame是使用变量名称和值构建的，很容易混淆，这种情况很多。

baseball <- data.frame(Bat=sample(1:100, 20, replace=T), Batter=sample(c("David", "Batley", "Bob", "Ace"), 20, replace=T), Hits=sample(1:20, 20, replace=T))
baseball
bMod <- lm(Hits ~ Bat*Batter, baseball)
bMod$coefficients

col.matx <- sapply(colnames(model.matrix(bMod)), function(cols) sapply(labels(bMod), function(trm) grep(patt=trm, x=cols, value=TRUE)))

这会将连续变量Bat与factor Batter的所有系数相匹配。

是的，这是一个愚蠢的例子，但很容易发生。

Answer 1

这不是确切的答案，但应该能够找到一种方法来重新排列它以适合您的目的。第一步确定每个labels列名称中model.matrix中的哪一个。

col.matx <- sapply(colnames(model.matrix(mod)), function(cols) 
         sapply(labels(mod), function(trm) grep(patt=trm, x=cols, value=TRUE)))
#---------------------------------------
               (Intercept) total_bill   sexMale     daySat      daySun      dayThur    
total_bill     Character,0 "total_bill" Character,0 Character,0 Character,0 Character,0
sex            Character,0 Character,0  "sexMale"   Character,0 Character,0 Character,0
day            Character,0 Character,0  Character,0 "daySat"    "daySun"    "dayThur"  
total_bill:sex Character,0 Character,0  Character,0 Character,0 Character,0 Character,0
sex:day        Character,0 Character,0  Character,0 Character,0 Character,0 Character,0
               total_bill:sexMale   sexMale:daySat   sexMale:daySun   sexMale:dayThur  
total_bill     "total_bill:sexMale" Character,0      Character,0      Character,0      
sex            "total_bill:sexMale" "sexMale:daySat" "sexMale:daySun" "sexMale:dayThur"
day            Character,0          "sexMale:daySat" "sexMale:daySun" "sexMale:dayThur"
total_bill:sex "total_bill:sexMale" Character,0      Character,0      Character,0      
sex:day        Character,0          Character,0      Character,0      Character,0

当值不是零长度项时，结果的rownames需要与矩阵值关联：

> which(sapply(col.matx, length) != 0 , arr.ind=TRUE)
 [1]  6 12 18 23 28 31 32 34 37 38 42 43 47 48

因此，这将从上面的矩阵中挑选出项目名称，并将它们与term.labels相关联使用模运算：

data.frame(coef =  unlist(col.matx[
                      which(sapply(col.matx, length) != 0 , arr.ind=TRUE)] ), 
           term.label =rownames(col.matx)[
                       which(sapply(col.matx, length) != 0 , arr.ind=TRUE) %% 5 ])
                 coef     term.label
1          total_bill     total_bill
2             sexMale            sex
3              daySat            day
4              daySun            day
5             dayThur            day
6  total_bill:sexMale     total_bill
7  total_bill:sexMale            sex
8  total_bill:sexMale total_bill:sex
9      sexMale:daySat            sex
10     sexMale:daySat            day
11     sexMale:daySun            sex
12     sexMale:daySun            day
13    sexMale:dayThur            sex
14    sexMale:dayThur            day

折叠值方法是对SO的频繁请求。有一个在过去24小时内得到了解答。

Answer 2

好的，使用完全包含在模型中的信息为lm模型找到了解决方案。

require(plyr)       # for join function
require(reshape2)   # for melt function

matchCoefs <- function(model)
{
    # get the terms
    theTerms <- model$terms
    # get the assignment position
    thePos <- model$assign
    # get intercept indicator
    inter <- attr(theTerms, "intercept")
    # get coef names
    coefNames <- names(coef(model))
    # get pred names
    predNames <- attr(theTerms, "term.labels")
    # expand out pred names to match coefficient names
    predNames <- predNames[thePos]
    # if there's an intercept term add it to the pred names
    if(inter == 1)
    {
        predNames <- c("(Intercept)", predNames)
    }

    # build data.frame linking term to coefficient name
    matching <- data.frame(Term=predNames, Coefficient=coefNames)

    ## now match individual predictor to term
    # get matrix as data.frame
    factorMat <- as.data.frame(attr(theTerms, "factor"))
    # add column from rownames as identifier
    factorMat$.Pred <- rownames(factorMat)
    # melt it down for comparison
    factorMelt <- melt(factorMat, id.vars=".Pred", variable.name="Term", )
    # only keep rows where there's a match
    factorMelt <- factorMelt[factorMelt$value == 1, ]
    # again, bring in coefficient if needed
    if(inter == 1)
    {
        factorMelt <- rbind(data.frame(.Pred="(Intercept)", Term="(Intercept)", value=1), factorMelt)
    }
    # join into the matching data.frame
    matching <- join(matching, factorMelt, by="Term")

    return(matching)
}

# fit some models with different terms
mod1 <- lm(tip ~ total_bill * sex + day, tips)
mod2 <- lm(tip ~ total_bill * sex + day - 1, tips)
mod3 <- lm(tip ~ (total_bill + sex + day)^3, tips)
mod4 <- lm(tip ~ total_bill * sex + day + I(total_bill^2), tips)

matchCoefs(mod1)
matchCoefs(mod2)
matchCoefs(mod3)
matchCoefs(mod4)

# now with the convoluted baseball example
baseball <- data.frame(Bat=sample(1:100, 20, replace=T), Batter=sample(c("David", "Batley", "Bob", "Ace"), 20, replace=T), Hits=sample(1:20, 20, replace=T))
bMod <- lm(Hits ~ Bat*Batter, baseball)
matchCoefs(bMod)

没有循环，所有内置函数，没有正则表达式。我需要对此进行更多测试并输入数据类型信息，但这应该相对简单。

R中的系数变量成员资格

2 个答案: