Question

背景：我正在进行一个小的A / B测试，有2x2个因子（前景为黑色，背景为白色，颜色为正常颜色），Analytics reports 4个条件中的每一个的点击次数和以什么速度“转换”（二元变量，我定义为在页面上花费至少40秒）。这很容易做一点编辑并获得一个漂亮的R数据帧：

rates <- read.csv(stdin(),header=TRUE)
Black,White,N,Rate
TRUE,FALSE,512,0.2344
FALSE,TRUE,529,0.2098
TRUE,TRUE,495,0.1919
FALSE,FALSE,510,0.1882

当然，我想看一下像Rate ~ Black * White之类的逻辑回归，但是R glm想要一个2046行的数据帧，每个行报告一个TRUE或FALSE转换价值＆amp; Black和White的值。这......有点棘手。我用Google搜索并检查了SO，但是当我发现一些关于如何将偶然事件计数表转换为数据帧的笨重代码时，我没有找到关于百分比/率的任何信息。

经过很多麻烦后，我想出了一个关于4个条件的循环，我用相关条件值和结果rate * n重复数据帧True次，然后做同样的事情但对于(1 - rate) * n和结果False，然后将所有8个数据帧拼接成一个巨大的数据帧：

ground <- NULL
for (i in 1:nrow(rates)) {
        x <- rates[i,]
        y <- do.call("rbind", replicate((x$N * x$Rate),     data.frame(Black=c(x$Black),White=c(x$White),Conversion=c(TRUE)),  simplify = FALSE))
        z <- do.call("rbind", replicate((x$N * (1-x$Rate)), data.frame(Black=c(x$Black),White=c(x$White),Conversion=c(FALSE)), simplify = FALSE))
        ground <- rbind(ground,y,z)
}

结果数据框ground看起来正确：

sum(rates$N)
[1] 2046
nrow(ground)
[1] 2042
# the missing 4 are probably from the rounding-off of the reported conversion rate
summary(ground); head(ground, n=20)
   Black           White         Conversion     
 Mode :logical   Mode :logical   Mode :logical  
 FALSE:1037      FALSE:1020      FALSE:1623     
 TRUE :1005      TRUE :1022      TRUE :419      
 NA's :0         NA's :0         NA's :0        
   Black White Conversion
1   TRUE FALSE       TRUE
2   TRUE FALSE       TRUE
3   TRUE FALSE       TRUE
4   TRUE FALSE       TRUE
5   TRUE FALSE       TRUE
6   TRUE FALSE       TRUE
7   TRUE FALSE       TRUE
8   TRUE FALSE       TRUE
9   TRUE FALSE       TRUE
10  TRUE FALSE       TRUE
11  TRUE FALSE       TRUE
12  TRUE FALSE       TRUE
13  TRUE FALSE       TRUE
14  TRUE FALSE       TRUE
15  TRUE FALSE       TRUE
16  TRUE FALSE       TRUE
17  TRUE FALSE       TRUE
18  TRUE FALSE       TRUE
19  TRUE FALSE       TRUE
20  TRUE FALSE       TRUE

同样，逻辑回归吐出了一个看似合理的答案：

g <- glm(Conversion ~ Black*White, family=binomial, data=ground); summary(g)
...
Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-0.732  -0.683  -0.650  -0.643   1.832  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)
(Intercept)           -1.472      0.114  -12.94   <2e-16
BlackTRUE              0.291      0.154    1.88    0.060
WhiteTRUE              0.137      0.156    0.88    0.381
BlackTRUE:WhiteTRUE   -0.404      0.220   -1.84    0.066

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2072.7  on 2041  degrees of freedom
Residual deviance: 2068.2  on 2038  degrees of freedom
AIC: 2076

Number of Fisher Scoring iterations: 4

所以我的问题是：是否有更优雅的方式将我的Google Analytics费率数据转换为glm输入而不是那个糟糕的循环？

Answer 1

rates$counts <- rates$N*rates$Rate
rates$counts <- round(rates$counts,0)
 rates
#----------
  Black White   N   Rate counts
1  TRUE FALSE 512 0.2344    120
2 FALSE  TRUE 529 0.2098    111
3  TRUE  TRUE 495 0.1919     95
4 FALSE FALSE 510 0.1882     96

> rates$failures <-rates$N -rates$counts    s
> glm(cbind(counts,failures)~Black*White, data=rates, family="binomial")

Call:  glm(formula = cbind(counts, failures) ~ Black * White, family = "binomial", 
    data = rates)

Coefficients:
        (Intercept)            BlackTRUE            WhiteTRUE  
            -1.4615               0.2777               0.1356  
BlackTRUE:WhiteTRUE  
            -0.3894  

Degrees of Freedom: 3 Total (i.e. Null);  0 Residual
Null Deviance:      4.104 
Residual Deviance: -7.461e-14   AIC: 33.05

Answer 2

有一件事是如何转换您的数据。另一个是原因。从?glm：＆＃34; [f]或二项[...] famil [y]可以将响应指定为一个因子（当第一级表示失败而其他所有成功时）或者作为一个双列矩阵，列中给出了成功和失败的数量。＆＃34;。第一种方式对应于你的＆＃34; R＆m; glm想要一个2046行的数据帧，每行报告一次TRUE或FALSE转换＆＃34;。第二种方式基本上对应于原始数据集，其中＆＃34;成功＆＃34;很容易从Rate和N计算出来。第三种方法是使用每个治疗组合的成功比例作为反应变量，在这种情况下必须以weights参数提供试验次数。

set.seed(1)
 # one row per observation
 df1 <- data.frame(x = sample(c("yes", "no"), 40, replace = TRUE),
                 y = sample(c("yes", "no"), 40, replace = TRUE),
                 z = rbinom(n = 40, size = 1, prob = 0.5))
df1

library(plyr)
# aggregated data with one row per treatment combination
df2 <- ddply(.data = df1, .variables = .(x, y), summarize,
             n = length(z),
             rate = sum(z)/n,
             success = n*rate,
             failure = n - success)  
df2

# three different ways to specify the models,
# which all give the same parameter estimates for x, y and x*y
mod1 <- glm(z ~ x * y, data = df1, family = binomial) 
mod2 <- glm(cbind(success, failure) ~ x * y, data = df2, family = binomial)
mod3 <- glm(rate ~ x * y, data = df2, weights = n, family = binomial)

summary(mod1)
summary(mod2)
summary(mod3)

Answer 3

不太明确您要转换的内容，但如果您需要的只是n行N列中的每个值，那么编辑 - 我很草率。首先 - 根据需要将原始文件中的所有因子转换为数字或字符。那么，

# just put in placeholder values
newdf<-data.frame(Black="n",White="n",Rate=0,stringsAsFactors=FALSE) 
newdf[1:rates[1,3],]<-rates[1,c(1,2,4)]
    newdf[4:rates[2,3],] <- rates[2,c(1,2,4)]

等原始rates数据框中的每一行。

优雅地将费率摘要行转换为长二进制响应行？

3 个答案: