如何按客户名称分组并计算已发送的邮件数?

时间:2019-06-24 04:34:37

标签: r

我正在寻找解决R场景的指南,其中我有一个带有以下列的表-客户名称,邮政编码,date_applied(应用日期),date_mail_sent(发送广告的日期)关于服务)

我正在寻找按customer_name分组的数量以及导致他们申请的邮件数量。彼此之间90天内寄出的日期应合并在一起。这些邮件中的最新邮件应在应用日期的90天内。

样本数据集:

customer_name   zip_code    date_applied    date_mail_sent
aaaa            12345           1/15/2017   4/3/2016
aaaa            12345           1/15/2017   9/15/2016
aaaa            12345           1/15/2017   12/5/2016
aaaa            12345           5/15/2017   4/1/2017
aaaa            12345           5/15/2017   4/15/2017
bbbb            45667           3/3/2017    1/2/2017
bbbb            45667           3/3/2017    2/2/2017
bbbb            45667           3/3/2017    3/2/2017
bbbb            45667           12/15/2017  6/30/2017
bbbb            45667           12/15/2017  7/30/2017

所需结果:

customer_name   zip_code    date_applied    count_of_mail_sent
aaaa            12345           1/5/2016    2
aaaa            12345           5/15/2016   2
bbbb            45667           3/3/2017    3
bbbb            45667           12/15/2017  0

3 个答案:

答案 0 :(得分:1)

您可以这样做:

library(dplyr)
df %>% 
  mutate(interval = difftime(date_applied, date_mail_sent, units = "days") <= 90) %>% 
  group_by(customer_name, zip_code, date_applied) %>%
  summarise(count_of_mail_sent = sum(interval))

  customer_name zip_code date_applied count_of_mail_sent
  <fct>            <int> <date>                    <int>
1 aaaa             12345 2017-01-15                    2
2 aaaa             12345 2017-05-15                    2
3 bbbb             45667 2017-03-03                    3
4 bbbb             45667 2017-12-15                    0

甚至:

df %>% 
  group_by(customer_name, zip_code, date_applied) %>%
  summarise(count_of_mail_sent = sum(difftime(date_applied, date_mail_sent, units = "days") <= 90))

使用的数据:

df = read.table(text = "customer_name   zip_code    date_applied    date_mail_sent
aaaa            12345           1/15/2017   4/3/2016
aaaa            12345           1/15/2017   11/28/2016
aaaa            12345           1/15/2017   12/5/2016
aaaa            12345           5/15/2017   4/1/2017
aaaa            12345           5/15/2017   4/15/2017
bbbb            45667           3/3/2017    1/2/2017
bbbb            45667           3/3/2017    2/2/2017
bbbb            45667           3/3/2017    3/2/2017
bbbb            45667           12/15/2017  6/30/2017
bbbb            45667           12/15/2017  7/30/2017", h = T)

df$date_applied <- as.Date(df$date_applied, format = "%m/%d/%Y")
df$date_mail_sent <- as.Date(df$date_mail_sent, format = "%m/%d/%Y")

答案 1 :(得分:1)

library(tidyverse)
library(lubridate)
df %>%
  mutate_at(vars(contains("date")), mdy) %>%
  mutate(days_prior = (date_applied - date_mail_sent) / ddays(1)) %>%
  count(customer_name, date_applied, wt = days_prior <= 90)

# A tibble: 4 x 3
  customer_name date_applied     n
  <chr>         <date>       <int>
1 aaaa          2017-01-15       2
2 aaaa          2017-05-15       2
3 bbbb          2017-03-03       3
4 bbbb          2017-12-15       0

答案 2 :(得分:1)

从逻辑上讲,这与发布的其他解决方案相同,转换为日期,得到date_applieddate_mail_sent之间的差异,并找到差异小于90的出现次数,但以基数为基础R

df[3:4] <- lapply(df[3:4], as.Date, "%m/%d/%Y")
df$is_less_than_90 <- as.numeric(df$date_applied - df$date_mail_sent, "days") <= 90

aggregate(is_less_than_90~customer_name + zip_code + date_applied, df, sum)

#  customer_name zip_code date_applied is_less_than_90
#1          aaaa    12345   2017-01-15               2
#2          bbbb    45667   2017-03-03               3
#3          aaaa    12345   2017-05-15               2
#4          bbbb    45667   2017-12-15               0