我正在寻找解决R场景的指南,其中我有一个带有以下列的表-客户名称,邮政编码,date_applied(应用日期),date_mail_sent(发送广告的日期)关于服务)。
我正在寻找按customer_name分组的数量以及导致他们申请的邮件数量。彼此之间90天内寄出的日期应合并在一起。这些邮件中的最新邮件应在应用日期的90天内。
样本数据集:
customer_name zip_code date_applied date_mail_sent
aaaa 12345 1/15/2017 4/3/2016
aaaa 12345 1/15/2017 9/15/2016
aaaa 12345 1/15/2017 12/5/2016
aaaa 12345 5/15/2017 4/1/2017
aaaa 12345 5/15/2017 4/15/2017
bbbb 45667 3/3/2017 1/2/2017
bbbb 45667 3/3/2017 2/2/2017
bbbb 45667 3/3/2017 3/2/2017
bbbb 45667 12/15/2017 6/30/2017
bbbb 45667 12/15/2017 7/30/2017
所需结果:
customer_name zip_code date_applied count_of_mail_sent
aaaa 12345 1/5/2016 2
aaaa 12345 5/15/2016 2
bbbb 45667 3/3/2017 3
bbbb 45667 12/15/2017 0
答案 0 :(得分:1)
您可以这样做:
library(dplyr)
df %>%
mutate(interval = difftime(date_applied, date_mail_sent, units = "days") <= 90) %>%
group_by(customer_name, zip_code, date_applied) %>%
summarise(count_of_mail_sent = sum(interval))
customer_name zip_code date_applied count_of_mail_sent
<fct> <int> <date> <int>
1 aaaa 12345 2017-01-15 2
2 aaaa 12345 2017-05-15 2
3 bbbb 45667 2017-03-03 3
4 bbbb 45667 2017-12-15 0
甚至:
df %>%
group_by(customer_name, zip_code, date_applied) %>%
summarise(count_of_mail_sent = sum(difftime(date_applied, date_mail_sent, units = "days") <= 90))
使用的数据:
df = read.table(text = "customer_name zip_code date_applied date_mail_sent
aaaa 12345 1/15/2017 4/3/2016
aaaa 12345 1/15/2017 11/28/2016
aaaa 12345 1/15/2017 12/5/2016
aaaa 12345 5/15/2017 4/1/2017
aaaa 12345 5/15/2017 4/15/2017
bbbb 45667 3/3/2017 1/2/2017
bbbb 45667 3/3/2017 2/2/2017
bbbb 45667 3/3/2017 3/2/2017
bbbb 45667 12/15/2017 6/30/2017
bbbb 45667 12/15/2017 7/30/2017", h = T)
df$date_applied <- as.Date(df$date_applied, format = "%m/%d/%Y")
df$date_mail_sent <- as.Date(df$date_mail_sent, format = "%m/%d/%Y")
答案 1 :(得分:1)
library(tidyverse)
library(lubridate)
df %>%
mutate_at(vars(contains("date")), mdy) %>%
mutate(days_prior = (date_applied - date_mail_sent) / ddays(1)) %>%
count(customer_name, date_applied, wt = days_prior <= 90)
# A tibble: 4 x 3
customer_name date_applied n
<chr> <date> <int>
1 aaaa 2017-01-15 2
2 aaaa 2017-05-15 2
3 bbbb 2017-03-03 3
4 bbbb 2017-12-15 0
答案 2 :(得分:1)
从逻辑上讲,这与发布的其他解决方案相同,转换为日期,得到date_applied
和date_mail_sent
之间的差异,并找到差异小于90的出现次数,但以基数为基础R
df[3:4] <- lapply(df[3:4], as.Date, "%m/%d/%Y")
df$is_less_than_90 <- as.numeric(df$date_applied - df$date_mail_sent, "days") <= 90
aggregate(is_less_than_90~customer_name + zip_code + date_applied, df, sum)
# customer_name zip_code date_applied is_less_than_90
#1 aaaa 12345 2017-01-15 2
#2 bbbb 45667 2017-03-03 3
#3 aaaa 12345 2017-05-15 2
#4 bbbb 45667 2017-12-15 0