我的数据包含时间变量和所选品牌变量,如下所示。时间表示购物时间,selectedbrand表示当时购买的品牌。
有了这些数据,我想创建一个rank变量,如第三列,第四列所示,依此类推。
品牌的排名(例如,品牌1 - 品牌3)应基于过去36小时。因此,要计算第二行的排名,其排名为"2013-09-01 08:54:00 UTC"
,排名应基于时间前36小时内的所有chosenbrand
值。 (第二行brand1
不应该在36小时内)
因此,rank_brand1,rank_brand2,rank_brand3,rank_bran4 ,,,是我想要的变量。
如果我想创建rank_brand5,rank_brand6以及......
有简单的方法吗?
另外,如果我想个人做(如果每位顾客有几个购买的历史记录),该怎么做?
数据如下,
shoptime chosenbrand rank_brand1 rank_brand2 rank_brand3, ...
2013-09-01 08:35:00 UTC brand1 NA NA NA
2013-09-01 08:54:00 UTC brand1 1 NA NA
2013-09-01 09:07:00 UTC brand2 1 2 NA
2013-09-01 09:08:00 UTC brand3 1 2 3
2013-09-01 09:11:00 UTC brand5 1 2 3
2013-09-01 09:14:00 UTC brand2 1 2 3
2013-09-01 09:26:00 UTC brand6 1 1 3
2013-09-01 09:26:00 UTC brand2 1 1 3
2013-09-01 09:29:00 UTC brand2 2 1 3
2013-09-01 09:32:00 UTC brand4 2 1 3
这是数据代码
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
"2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
chosenbrand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
rank_brand1 = NA,
rank_brand2 = NA,
rank_brand3 = NA,
stringsAsFactors = FALSE)
答案 0 :(得分:4)
这是一个棘手的问题。下面的解决方案使用非equi连接聚合36小时,dcast()
从长格式转换为宽格式,第二个连接使用原始dat
。可以有任意数量的品牌。
library(data.table)
library(lubridate)
setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, shoptime) # not required, just for convenience of observers
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
, dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")][
dat, on = "shoptime"]
shoptime brand1 brand2 brand3 brand5 brand6 brand 1: 2013-09-01 08:35:00 NA NA NA NA NA brand1 2: 2013-09-01 08:54:00 1 NA NA NA NA brand1 3: 2013-09-01 09:07:00 1 NA NA NA NA brand2 4: 2013-09-01 09:08:00 1 2 NA NA NA brand3 5: 2013-09-01 09:11:00 1 2 2 NA NA brand5 6: 2013-09-01 09:14:00 1 2 2 2 NA brand2 7: 2013-09-01 09:26:00 1 1 2 2 NA brand6 8: 2013-09-01 09:26:00 1 1 2 2 NA brand2 9: 2013-09-01 09:29:00 2 1 3 3 3 brand2 10: 2013-09-01 09:32:00 2 1 3 3 3 brand4
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]]
每36小时返回一次汇总结果:
shoptime shoptime brand N rank 1: 2013-08-30 20:54:00 2013-09-01 08:54:00 brand1 1 1 2: 2013-08-30 21:07:00 2013-09-01 09:07:00 brand1 2 1 3: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand1 2 1 4: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand2 1 2 5: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand1 2 1 6: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand2 1 2 7: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand3 1 2 8: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand1 2 1 9: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand2 1 2 10: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand3 1 2 11: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand5 1 2 12: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2 1 13: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2 1 14: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1 2 15: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1 2 16: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2 1 17: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2 1 18: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1 2 19: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1 2 20: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand1 2 2 21: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand2 3 1 22: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand3 1 3 23: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand5 1 3 24: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand6 1 3 25: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand1 2 2 26: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand2 4 1 27: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand3 1 3 28: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand5 1 3 29: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand6 1 3 shoptime shoptime brand N rank
然后,这个中间结果从长格式转换为宽格式:
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
, dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")]
shoptime brand1 brand2 brand3 brand5 brand6 1: 2013-09-01 08:54:00 1 NA NA NA NA 2: 2013-09-01 09:07:00 1 NA NA NA NA 3: 2013-09-01 09:08:00 1 2 NA NA NA 4: 2013-09-01 09:11:00 1 2 2 NA NA 5: 2013-09-01 09:14:00 1 2 2 2 NA 6: 2013-09-01 09:26:00 1 1 2 2 NA 7: 2013-09-01 09:29:00 2 1 3 3 3 8: 2013-09-01 09:32:00 2 1 3 3 3
与原始dat
数据框的最终右连接完成了缺少的行和列(请参阅上面的代码和结果)。
dat <- data.frame(
shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
"2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
brand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
stringsAsFactors = FALSE)
答案 1 :(得分:0)
一种可能性是将一个函数(带loop
)写入作业。考虑OP中提供的数据:
library(dplyr)
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
"2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
chosenbrand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
rank_brand1 = NA,
rank_brand2 = NA,
rank_brand3 = NA,
stringsAsFactors = FALSE)
#Write a function that data.frame and calculate rank
Calculate.Rank <- function(x){
#loop through each row and calculate count for each brand
for(i in 1:nrow(x)){
#DateTime of the current row.
currentrow.time <- as.POSIXlt(x$shoptime[i])
#calculate number of times brand1 appears
x$rank_brand1[i] <- nrow(filter(x, as.POSIXlt(shoptime) <= currentrow.time & as.POSIXlt(shoptime) >= (currentrow.time-36*60*60) & chosenbrand == "brand1" ))
#calculate number of times brand2 appears
x$rank_brand2[i] <- nrow(filter(x, as.POSIXlt(shoptime) <= currentrow.time & as.POSIXlt(shoptime) >= (currentrow.time-36*60*60) & chosenbrand == "brand2" ))
#calculate number of times brand3 appears
x$rank_brand3[i] <- nrow(filter(x, as.POSIXlt(shoptime) <= currentrow.time & as.POSIXlt(shoptime) >= (currentrow.time-36*60*60) & chosenbrand == "brand3" ))
#Replace the 0 values with NA. I dont think this right approach as one can consider those count to be 0 anyway
if(x$rank_brand1[i] == 0 ){
x$rank_brand1[i] = NA
}
if(x$rank_brand2[i] == 0 ){
x$rank_brand2[i] = NA
}
if(x$rank_brand3[i] == 0 ){
x$rank_brand3[i] = NA
}
}
#Now count of brand1, brand2 and brand3 is available now. Lets calculate rank.
new.x <- data.frame(x[,1:2], t(apply(-x[,3:5], 1, rank, ties.method='min', na.last = "keep")))
print(new.x)
}
Calculate.Rank(dat)
结果data.frame new.x
将如下所示:
shoptime chosenbrand rank_brand1 rank_brand2 rank_brand3
1 2013-09-01 08:35:00 UTC brand1 1 NA NA
2 2013-09-01 08:54:00 UTC brand1 1 NA NA
3 2013-09-01 09:07:00 UTC brand2 1 2 NA
4 2013-09-01 09:08:00 UTC brand3 1 2 2
5 2013-09-01 09:11:00 UTC brand5 1 2 2
6 2013-09-01 09:14:00 UTC brand2 1 1 3
7 2013-09-01 09:26:00 UTC brand6 2 1 3
8 2013-09-01 09:26:00 UTC brand2 2 1 3
9 2013-09-01 09:29:00 UTC brand2 2 1 3
10 2013-09-01 09:32:00 UTC brand4 2 1 3