我的数据包含时间变量和所选品牌变量,如下所示。时间表示购物时间,selectedbrand表示当时购买的品牌。
有了这些数据,我想在下表中创建第三和第四列。在这里创建列是一些规则。第三(第四)列表示基于5天内选择的频率的品牌1(品牌2)的等级。如果5天内没有历史记录,那么它应该是NA。
例如,让我们看第5行。第5行shoptime
为2013-09-05 09:11:00
,则5天窗口为2013-08-31 09:11:00
〜2013-09-05 09:11:00
。在这段时间内,有brand3,bradn3,brand2,
和brand1(排除第5行的chosenbrand
)。基于最常选择的brand1
(第三列)的排名是第二,brand2
的排名也是第二。因此,第5行中的两列应为2和2.
作为另一个例子,让我们看看下表中的最后一行。该行的shoptime
为2013-09-09 09:32:00
,则5天窗口为2013-09-04 09:32:00
〜2013-09-09 09:32:00
。在这段时间内,有brand1,bradn2,brand6,brand2和brand2(不包括行chosenbrand
)。基于最常选择的brand1
(第三列)的排名是第二位,brand2
的排名是第一位。因此,行中的两列应为2和1。
有简单的方法吗?
另外,如果我想个人做(如果每位顾客有几个购买的历史记录),该怎么做?
数据如下,
shoptime chosenbrand nth_most_freq_brand1 nth_most_freq_brand2
2013-09-01 08:35:00 brand3 NA NA
2013-09-02 08:54:00 brand3 NA NA
2013-09-03 09:07:00 brand2 NA NA
2013-09-04 09:08:00 brand1 NA 2
2013-09-05 09:11:00 brand1 2 2
2013-09-06 09:14:00 brand2 1 2
2013-09-07 09:26:00 brand6 1 1
2013-09-08 09:26:00 brand2 1 2
2013-09-09 09:29:00 brand2 2 1
2013-09-09 09:32:00 brand4 2 1
这是数据代码
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-02 08:54:00 UTC", "2013-09-03 09:07:00 UTC" ,"2013-09-04 09:08:00 UTC", "2013-09-05 09:11:00 UTC", "2013-09-06 09:14:00 UTC",
"2013-09-07 09:26:00 UTC", "2013-09-08 09:26:00 UTC" ,"2013-09-09 09:29:00 UTC", "2013-09-09 09:32:00 UTC"),
chosenbrand = c("brand3", "brand3", "brand2", "brand1", "brand1", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
nth_most_freq_brand1 = NA,
nth_most_freq_brand2 = NA,
stringsAsFactors = FALSE)
答案 0 :(得分:3)
OP提出了一个非常类似的问题"How to create a rank variable under certain conditions?"。如果我理解正确,唯一的区别是
brand1
和brand2
(而不是chosenbrands
的所有值)。因此,my answer可以通过一些改编和改进在这里重复使用:
library(data.table)
library(lubridate)
setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, shoptime) # not required, just for convenience of observers
selected_brands <- c("brand1", "brand2")
result <- dat[
.(lb = shoptime - hours(5 * 24), ub = shoptime),
on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = chosenbrand][, rank := frank(-N, ties.method="dense")]][
chosenbrand %in% selected_brands,
dcast(unique(.SD[, -1]), shoptime ~ paste0("nth_most_freq_", chosenbrand),
value.var = "rank")][
dat, on = "shoptime"]
# change column order to make it look more similar to the expected answer
setcolorder(result, c(1, 4, 2:3))
result
shoptime chosenbrand nth_most_freq_brand1 nth_most_freq_brand2 1: 2013-09-01 08:35:00 brand3 NA NA 2: 2013-09-02 08:54:00 brand3 NA NA 3: 2013-09-03 09:07:00 brand2 NA NA 4: 2013-09-04 09:08:00 brand1 NA 2 5: 2013-09-05 09:11:00 brand1 2 2 6: 2013-09-06 09:14:00 brand2 1 2 7: 2013-09-07 09:26:00 brand6 1 1 8: 2013-09-08 09:26:00 brand2 1 2 9: 2013-09-09 09:29:00 brand2 2 1 10: 2013-09-09 09:32:00 brand4 2 1
OP已经提出了另一个问题:
另外,如果我想个人做(如果每位顾客有几个购买的历史记录),该怎么做?
不幸的是,OP没有为此案例提供样本数据集。因此,我们需要根据提供的数据集为两个客户组成数据集:
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-02 08:54:00 UTC", "2013-09-03 09:07:00 UTC" ,"2013-09-04 09:08:00 UTC", "2013-09-05 09:11:00 UTC", "2013-09-06 09:14:00 UTC",
"2013-09-07 09:26:00 UTC", "2013-09-08 09:26:00 UTC" ,"2013-09-09 09:29:00 UTC", "2013-09-09 09:32:00 UTC"),
chosenbrand = c("brand3", "brand3", "brand2", "brand1", "brand1", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
stringsAsFactors = FALSE)
dat <- rbindlist(list(dat, dat[c(FALSE, TRUE), ]), idcol = "customer")
dat
customer shoptime chosenbrand 1: 1 2013-09-01 08:35:00 UTC brand3 2: 1 2013-09-02 08:54:00 UTC brand3 3: 1 2013-09-03 09:07:00 UTC brand2 4: 1 2013-09-04 09:08:00 UTC brand1 5: 1 2013-09-05 09:11:00 UTC brand1 6: 1 2013-09-06 09:14:00 UTC brand2 7: 1 2013-09-07 09:26:00 UTC brand6 8: 1 2013-09-08 09:26:00 UTC brand2 9: 1 2013-09-09 09:29:00 UTC brand2 10: 1 2013-09-09 09:32:00 UTC brand4 11: 2 2013-09-02 08:54:00 UTC brand3 12: 2 2013-09-04 09:08:00 UTC brand1 13: 2 2013-09-06 09:14:00 UTC brand2 14: 2 2013-09-08 09:26:00 UTC brand2 15: 2 2013-09-09 09:32:00 UTC brand4
现在,我们可以修改现有的解决方案来考虑不同的客户:
setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, customer, shoptime) # not required, just for convenience of observers
selected_brands <- c("brand1", "brand2")
result <- dat[
.(customer = customer, lb = shoptime - hours(5 * 24), ub = shoptime),
on = .(customer, shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = chosenbrand][, rank := frank(-N, ties.method="dense")]][
chosenbrand %in% selected_brands,
dcast(unique(.SD[, -2]), customer + shoptime ~ paste0("nth_most_freq_", chosenbrand),
value.var = "rank")][
dat, on = .(customer, shoptime)]
# change column order to make it look more similar to the expected answer
setcolorder(result, c(1:2, 5, 3:4))
result
customer shoptime chosenbrand nth_most_freq_brand1 nth_most_freq_brand2 1: 1 2013-09-01 08:35:00 brand3 NA NA 2: 1 2013-09-02 08:54:00 brand3 NA NA 3: 1 2013-09-03 09:07:00 brand2 NA NA 4: 1 2013-09-04 09:08:00 brand1 NA 2 5: 1 2013-09-05 09:11:00 brand1 2 2 6: 1 2013-09-06 09:14:00 brand2 1 2 7: 1 2013-09-07 09:26:00 brand6 1 1 8: 1 2013-09-08 09:26:00 brand2 1 2 9: 1 2013-09-09 09:29:00 brand2 2 1 10: 1 2013-09-09 09:32:00 brand4 2 1 11: 2 2013-09-02 08:54:00 brand3 NA NA 12: 2 2013-09-04 09:08:00 brand1 NA NA 13: 2 2013-09-06 09:14:00 brand2 1 NA 14: 2 2013-09-08 09:26:00 brand2 1 1 15: 2 2013-09-09 09:32:00 brand4 NA 1
答案 1 :(得分:1)
library(tidyverse)
library(lubridate)
第1步:将shoptime
列转换为日期时间对象
dat <- dat %>% mutate(shoptime = ymd_hms(shoptime))
第2步:为所有shoptime
创建一个查找表。
complete
函数可以创建列之间的所有组合,因此我们可以创建shoptime
列(shoptime1
)的副本并创建所有组合。然后我们可以使用filter(shoptime1 > shoptime - hours(5 * 24), shoptime1 < shoptime)
查找日期和时间是否在5天内。
dat2 <- dat %>%
mutate(shoptime1 = shoptime) %>%
select(contains("shoptime")) %>%
complete(shoptime, shoptime1) %>%
filter(shoptime1 > shoptime - hours(5 * 24), shoptime1 < shoptime)
第3步:将dat
与查找表合并,计算品牌数量,并对计数数字进行排名。
我们可以根据dat2
和dat
合并查找表shoptime1
和shoptime
。 count
函数可以根据组计算出现次数。之后,我们可以对shoptime
进行分组,并使用dense_rank
创建每个组中每个品牌的排名。
dat3 <- dat2 %>%
left_join(dat, by = c("shoptime1" = "shoptime")) %>%
count(shoptime, chosenbrand) %>%
group_by(shoptime) %>%
mutate(rank = dense_rank(desc(n))) %>%
select(-n) %>%
spread(chosenbrand, rank) %>%
select(shoptime, brand1, brand2)
第4步:将原始数据框与dat3
数据框合并。
dat4 <- dat %>% left_join(dat3, by = "shoptime")
这是最终结果。
dat4
# shoptime chosenbrand brand1 brand2
# 1 2013-09-01 08:35:00 brand3 NA NA
# 2 2013-09-02 08:54:00 brand3 NA NA
# 3 2013-09-03 09:07:00 brand2 NA NA
# 4 2013-09-04 09:08:00 brand1 NA 2
# 5 2013-09-05 09:11:00 brand1 2 2
# 6 2013-09-06 09:14:00 brand2 1 2
# 7 2013-09-07 09:26:00 brand6 1 1
# 8 2013-09-08 09:26:00 brand2 1 2
# 9 2013-09-09 09:29:00 brand2 2 1
# 10 2013-09-09 09:32:00 brand4 2 1
由于OP没有提供示例数据集,我将使用示例数据集Uwe created。只有我的答案1稍作修改才能解决这个问题。关键是在某些步骤中将customer
列视为分组变量。
以下是创建示例数据集的代码。我只在最后添加as.tibble
以将data.table
对象转换为tibble
。
library(data.table)
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-02 08:54:00 UTC", "2013-09-03 09:07:00 UTC" ,"2013-09-04 09:08:00 UTC", "2013-09-05 09:11:00 UTC", "2013-09-06 09:14:00 UTC",
"2013-09-07 09:26:00 UTC", "2013-09-08 09:26:00 UTC" ,"2013-09-09 09:29:00 UTC", "2013-09-09 09:32:00 UTC"),
chosenbrand = c("brand3", "brand3", "brand2", "brand1", "brand1", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
stringsAsFactors = FALSE)
dat <- rbindlist(list(dat, dat[c(FALSE, TRUE), ]), idcol = "customer")
dat <- as.tibble(dat)
第1步:将shoptime
列转换为日期时间对象
dat <- dat %>% mutate(shoptime = ymd_hms(shoptime))
第2步:为所有shoptime
创建一个查找表。
请注意,代码与之前的代码几乎完全相同,只是我们需要在应用customer
函数之前对complete
进行分组。
dat2 <- dat %>%
mutate(shoptime1 = shoptime) %>%
select(contains("shoptime"), customer) %>%
group_by(customer) %>%
complete(shoptime, shoptime1) %>%
filter(shoptime1 > shoptime - hours(5 * 24), shoptime1 < shoptime)
第3步:将dat
与查找表合并,计算品牌数量,并对计数数字进行排名。
同样,我们在进行加入操作并计算品牌时需要考虑customer
列。
dat3 <- dat2 %>%
left_join(dat, by = c("customer", "shoptime1" = "shoptime")) %>%
count(customer, shoptime, chosenbrand) %>%
group_by(customer, shoptime) %>%
mutate(rank = dense_rank(-n)) %>%
select(-n) %>%
spread(chosenbrand, rank) %>%
select(customer, shoptime, brand1, brand2)
第4步:将原始数据框与dat3
数据框合并。
dat4 <- dat %>% left_join(dat3, by = c("customer", "shoptime"))
这是最终结果。我添加as.data.frame
只是为了以更简单的格式打印输出。
dat4 %>% as.data.frame()
# customer shoptime chosenbrand brand1 brand2
# 1 1 2013-09-01 08:35:00 brand3 NA NA
# 2 1 2013-09-02 08:54:00 brand3 NA NA
# 3 1 2013-09-03 09:07:00 brand2 NA NA
# 4 1 2013-09-04 09:08:00 brand1 NA 2
# 5 1 2013-09-05 09:11:00 brand1 2 2
# 6 1 2013-09-06 09:14:00 brand2 1 2
# 7 1 2013-09-07 09:26:00 brand6 1 1
# 8 1 2013-09-08 09:26:00 brand2 1 2
# 9 1 2013-09-09 09:29:00 brand2 2 1
# 10 1 2013-09-09 09:32:00 brand4 2 1
# 11 2 2013-09-02 08:54:00 brand3 NA NA
# 12 2 2013-09-04 09:08:00 brand1 NA NA
# 13 2 2013-09-06 09:14:00 brand2 1 NA
# 14 2 2013-09-08 09:26:00 brand2 1 1
# 15 2 2013-09-09 09:32:00 brand4 NA 1