我正在为如何表达我的问题而苦恼。我有一个帐户数据框架,我想创建一个新列,该标志是该帐户在30天内是否存在另一个帐户重复电子邮件的标志。
我有一张这样的桌子。
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John@gmail.com","John@gmail.com","Alex@outlook.com","Alan@yahoo.com","Stan@aol.com","Mary@outlook.com","Adam@outlook.com","Tom@aol.com","Jane@yahoo.com","John@gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(AccountNumbers,EmailAddress,Dates)
print(df)
AccountNumbers EmailAddress Dates
3748 John@gmail.com 2018-05-01
8894 John@gmail.com 2018-05-05
9923 Alex@outlook.com 2018-05-10
4502 Alan@yahoo.com 2018-05-15
7283 Stan@aol.com 2018-05-20
8012 Mary@outlook.com 2018-05-25
2938 Adam@outlook.com 2018-05-30
7485 Tom@aol.com 2018-06-01
1010 Jane@yahoo.com 2018-06-05
9877 John@gmail.com 2018-06-10
John@gmail.com出现了3次,我想标记前两行,因为它们都在30天内出现,但是我不想标记第三行。
AccountNumbers EmailAddress Dates DuplicateEmailFlag
3748 John@gmail.com 2018-05-01 1
8894 John@gmail.com 2018-05-05 1
9923 Alex@outlook.com 2018-05-10 0
4502 Alan@yahoo.com 2018-05-15 0
7283 Stan@aol.com 2018-05-20 0
8012 Mary@outlook.com 2018-05-25 0
2938 Adam@outlook.com 2018-05-30 0
7485 Tom@aol.com 2018-06-01 0
1010 Jane@yahoo.com 2018-06-05 0
9877 John@gmail.com 2018-06-10 0
我一直在尝试在mutate中使用ifelse(),但我不知道是否有可能告诉dplyr仅考虑被考虑行的30天内的行。
编辑:为澄清起见,我想看看每个帐户的30天。因此,如果我遇到的情况是每30天恰好要添加一个相同的电子邮件地址,则应标记该电子邮件的所有出现情况。
答案 0 :(得分:1)
这似乎有效。首先,我定义数据框。
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John@gmail.com","John@gmail.com","Alex@outlook.com","Alan@yahoo.com","Stan@aol.com","Mary@outlook.com","Adam@outlook.com","Tom@aol.com","Jane@yahoo.com","John@gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(number = AccountNumbers, email = EmailAddress, date = as.Date(Dates))
接下来,我通过电子邮件分组,并检查在前30天或后30天中是否有条目。我还将NA
(对应于只有一个条目的情况)替换为0。最后,我取消了分组。
df %>%
group_by(email) %>%
mutate(dupe = coalesce(date - lag(date) < 30, (date - lead(date) < 30))) %>%
mutate(dupe = replace_na(dupe, 0)) %>%
ungroup
这给了
# # A tibble: 10 x 4
# number email date dupe
# <dbl> <fct> <date> <dbl>
# 1 3748 John@gmail.com 2018-05-01 1
# 2 8894 John@gmail.com 2018-05-05 1
# 3 9923 Alex@outlook.com 2018-05-10 0
# 4 4502 Alan@yahoo.com 2018-05-15 0
# 5 7283 Stan@aol.com 2018-05-20 0
# 6 8012 Mary@outlook.com 2018-05-25 0
# 7 2938 Adam@outlook.com 2018-05-30 0
# 8 7485 Tom@aol.com 2018-06-01 0
# 9 1010 Jane@yahoo.com 2018-06-05 0
# 10 9877 John@gmail.com 2018-06-10 0
根据需要。
编辑:这隐式假设您的数据按日期排序。如果没有,那么您需要添加一个额外的步骤。
答案 1 :(得分:0)
我认为这可以满足您的需求:
df %>%
group_by(EmailAddress) %>%
mutate(helper = cumsum(coalesce(if_else(difftime(Dates, lag(Dates), 'days') <= 30, 0, 1), 0))) %>%
group_by(EmailAddress, helper) %>%
mutate(DuplicateEmailFlag = (n() >= 2)*1) %>%
ungroup() %>%
select(-helper)
# A tibble: 10 x 4
AccountNumbers EmailAddress Dates DuplicateEmailFlag
<dbl> <chr> <date> <dbl>
1 3748 John@gmail.com 2018-05-01 1
2 8894 John@gmail.com 2018-05-05 1
3 9923 Alex@outlook.com 2018-05-10 0
4 4502 Alan@yahoo.com 2018-05-15 0
5 7283 Stan@aol.com 2018-05-20 0
6 8012 Mary@outlook.com 2018-05-25 0
7 2938 Adam@outlook.com 2018-05-30 0
8 7485 Tom@aol.com 2018-06-01 0
9 1010 Jane@yahoo.com 2018-06-05 0
10 9877 John@gmail.com 2018-06-10 0
我认为@Lyngbakr的解决方案对于您所遇到的情况更好。如果重复组的大小可能发生变化(例如,您要在30天内彼此检查3或4个条目,而不是2个),则更合适。
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John@gmail.com","John@gmail.com","Alex@outlook.com","Alan@yahoo.com","Stan@aol.com","Mary@outlook.com","Adam@outlook.com","Tom@aol.com","Jane@yahoo.com","John@gmail.com")
Dates <- as.Date(c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10"))
df <- data.frame(AccountNumbers,EmailAddress,Dates, stringsAsFactors = FALSE)