我的数据框如下:
> head(data1)
Age Gender Impressions Clicks Signed_In agecat scode
1 36 0 3 0 1 (34,44] Imps
2 73 1 3 0 1 (64, Inf] Imps
3 30 0 3 0 1 (24,34] Imps
4 49 1 3 0 1 (44,54] Imps
5 47 1 11 0 1 (44,54] Imps
6 47 0 11 1 1 (44,54] Clicks
Str Info:
> str(data1)
'data.frame': 458441 obs. of 7 variables:
$ Age : int 36 73 30 49 47 47 0 46 16 52 ...
$ Gender : int 0 1 0 1 1 0 0 0 0 0 ...
$ Impressions: int 3 3 3 3 11 11 7 5 3 4 ...
$ Clicks : int 0 0 0 0 0 1 1 0 0 0 ...
$ Signed_In : int 1 1 1 1 1 1 0 1 1 1 ...
$ agecat : Factor w/ 8 levels "(-Inf,0]","(0,18]",..: 5 8 4 6 6 6 1 6 2 6 ...
$ scode : Factor w/ 3 levels "Clicks","Imps",..: 2 2 2 2 2 1 1 2 2 2 ...
>
对于每一行,要计算点击率(CTR),定义为(点击次数/展示次数)* 100。
我想获得每个类别中每个性别的平均点击率。 类似的东西:
Gender 0, Category (0,18] CTR = ??.
Gender 1, Category (0,18] CTR = ??.
Gender 0, Category (18,24] CTR = ??.
Gender 1, Category (18,24] CTR = ??.
and so on...
如何用R语言实现这一目标?
我最初尝试按性别分组的一些内容:
> calcCTR <- function(var1,var2){
+ (var1*100)/var2
+ }
使用summaryBy
调用它> summaryBy(Clicks~Gender, data=data1, FUN=calcCTR, var2=data1$Impressions)
这花了莫名其妙的时间。
另一种方法:
> summaryBy(((Clicks*100)/Impressions)~Gender, data=data1, FUN=sum)
Gender ((Clicks * 100)/Impressions).sum
1 0 NaN
2 1 NaN
>
我还在数据中添加了列CTR:
> data1$ctr = (data1$Clicks/data1$Impressions)*100
> head(data1)
Age Gender Impressions Clicks Signed_In agecat scode ctr
1 36 0 3 0 1 (34,44] Imps 0.000000
2 73 1 3 0 1 (64, Inf] Imps 0.000000
3 30 0 3 0 1 (24,34] Imps 0.000000
4 49 1 3 0 1 (44,54] Imps 0.000000
5 47 1 11 0 1 (44,54] Imps 0.000000
6 47 0 11 1 1 (44,54] Clicks 9.090909
>
然而,当我按性别或年龄对其进行分层时,它给了我NaN。
> summaryBy(ctr~agecat,
+ data=data1);
agecat ctr.mean
1 (-Inf,0] NaN
2 (0,18] NaN
3 (18,24] NaN
4 (24,34] NaN
5 (34,44] NaN
6 (44,54] NaN
7 (54,64] NaN
8 (64, Inf] NaN
> summaryBy(ctr~Gender,
+ data=data1);
Gender ctr.mean
1 0 NaN
2 1 NaN
>
答案 0 :(得分:1)
这应该让你前进:
library(data.table)
dt = as.data.table(data1)
dt[, mean((Clicks/Impressions)*100), by = list(Gender, agecat)]
答案 1 :(得分:0)
这个简单的例子可以帮助你开始
#create our trivial data set
dat<-data.frame(c1=rep(c("a","b"),each=2),c2=rep(1:2,2),val=rnorm(4))
#look into learning about tapply, lapply, apply, sapply,
tapply(dat$val, list(dat$c1,dat$c2),mean)