我有一个像这样的数据框
product_id view_count purchase_count
1 11 1
2 20 3
3 5 2
...
我想将此转换为一个按view_count分组的表格,并为一个时间间隔汇总purchase_count。
view_count_range total_purchase_count
0-10 45
10-20 65
这些view_count_ranges的大小固定。我很感激有关如何对这样的范围进行分组的任何建议。
答案 0 :(得分:5)
cut
是一种方便的工具。这是一种方式:
#First make some data to work with
#I suggest you do this in the future as it makes it
#easier to provide you with assistance.
set.seed(10)
dat <- data.frame(product_id=1:15, view_count=sample(1:20, 15, replace=T),
purchase_count=sample(1:8, 15, replace=T))
dat #look at the data
#now we can use cut and aggregate by this new variable we just created
dat$view_count_range <- with(dat, cut(view_count, c(0, 10, 20)))
aggregate(purchase_count~view_count_range, dat, sum)
哪个收益:
view_count_range purchase_count
1 (0,10] 39
2 (10,20] 31
答案 1 :(得分:2)
扩展Tyler的答案并从他的示例dat
开始,你可能会发现在data.table
中编写这样的查询更容易,更快捷:
> require(data.table)
> DT = as.data.table(dat)
> DT[, sum(purchase_count), by=cut(view_count,c(0,10,20))]
cut V1
[1,] (10,20] 31
[2,] (0,10] 39
就是这样。只需一行。易于编写,易于阅读。
注意它将(10,20)组放在第一位。这是因为默认情况下它保留了每个组首次出现在数据中的顺序(此数据集中第一个view_count
为11)。相反,请将by
更改为keyby
:
> DT[, sum(purchase_count), keyby=cut(view_count,c(0,10,20))]
cut V1
[1,] (0,10] 39
[2,] (10,20] 31
并命名结果列:
> DT[,list( purchase_count = sum(purchase_count) ),
keyby=list( view_count_range = cut(view_count,c(0,10,20) ))]
view_count_range purchase_count
[1,] (0,10] 39
[2,] (10,20] 31