我想在新列中的R到NA的数据框中设置唯一的电子邮件地址超过2个相似的起始年份。
start_year email
2016 a@a.com
2016 a@a.com
2016 a@a.com
2015 a@a.com
2015 a@a.com
2014 a@a.com
2015 b@b.com
2014 b@b.com
2014 b@b.com
2015 c@c.com
结果(a@a.com在2016年有3个类似的起始年份,因此在新专栏中设置为NA):
start_year email email_new
2016 a@a.com NA
2016 a@a.com NA
2016 a@a.com NA
2015 a@a.com NA
2015 a@a.com NA
2014 a@a.com NA
2015 b@b.com b@b.com
2014 b@b.com b@b.com
2014 b@b.com b@b.com
2015 c@c.com c@c.com
到目前为止,我有一个错误:所有select()输入必须解析为整数列位置。:
result <- df %>%
group_by(email) %>%
select(length(unique(start_year)) > 2)
非常感谢任何帮助。
答案 0 :(得分:0)
根据我的理解,使用dplyr
,将email
转换为NA有两个条件。
1)start_year
中至少有3个是相同的
2)有超过2个观察结果
df %>%
group_by(email) %>%
mutate(new = ifelse(length(which(table(start_year) > 2)) > 0 & n()>2, 'NA', as.character(email)))
#Source: local data frame [7 x 3]
#Groups: email [3]
# start_year email new
# <int> <chr> <chr>
#1 2016 a@a.com <NA>
#2 2016 a@a.com <NA>
#3 2016 a@a.com <NA>
#4 2015 b@b.com b@b.com
#5 2014 b@b.com b@b.com
#6 2014 b@b.com b@b.com
#7 2015 c@c.com c@c.com
为2014
添加另一个b@b.com
,从而使该电子邮件的年份相同,然后
df1 %>%
group_by(email) %>%
mutate(new = ifelse(length(which(table(start_year) > 2)) > 0 & n()>2, 'NA', as.character(email)))
#Source: local data frame [8 x 3]
#Groups: email [3]
# start_year email new
# <dbl> <chr> <chr>
#1 2016 a@a.com NA
#2 2016 a@a.com NA
#3 2016 a@a.com NA
#4 2015 b@b.com NA
#5 2014 b@b.com NA
#6 2014 b@b.com NA
#7 2014 b@b.com NA
#8 2015 c@c.com c@c.com
数据强>
dput(df)
structure(list(start_year = c(2016L, 2016L, 2016L, 2015L, 2014L,
2014L, 2015L), email = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L
), .Label = c("a@a.com", "b@b.com", "c@c.com"), class = "factor")), .Names = c("start_year",
"email"), class = "data.frame", row.names = c(NA, -7L))
dput(df1)
structure(list(start_year = c(2016, 2016, 2016, 2015, 2014, 2014,
2014, 2015), email = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
3L), .Label = c("a@a.com", "b@b.com", "c@c.com"), class = "factor")), row.names = c(NA,
-8L), .Names = c("start_year", "email"), class = "data.frame")
答案 1 :(得分:0)
试图在base R
tbl <- table(df$email, df$start_year) > 2
df$email_new <- ifelse(df$email %in% rownames(tbl)[row(tbl)[which(tbl)]], NA,
as.character(df$email))
df
# start_year email email_new
#1 2016 a@a.com <NA>
#2 2016 a@a.com <NA>
#3 2016 a@a.com <NA>
#4 2015 a@a.com <NA>
#5 2015 a@a.com <NA>
#6 2014 a@a.com <NA>
#7 2015 b@b.com b@b.com
#8 2014 b@b.com b@b.com
#9 2014 b@b.com b@b.com
#10 2015 c@c.com c@c.com
每年查找每封电子邮件的频率(table(df$email, df$start_year)
)。找出任何特定年份频率超过2的email
的名称,并将该行转换为NA
答案 2 :(得分:0)
我们可以使用data.table
。我们转换了&#39; data.frame&#39;到&#39; data.table&#39; (setDT(df1)
),按&#39;电子邮件&#39;分组if
length
元素的unique
uniqueN
大于2,将其指定为NA
或else
&#39;电子邮件& #39;创建&#39; email_new&#39;。
library(data.table)
setDT(df1)[, email_new := if(uniqueN(start_year)>2) NA_character_ else email, by = email]
df1
# start_year email email_new
#1: 2016 a@a.com NA
#2: 2016 a@a.com NA
#3: 2016 a@a.com NA
#4: 2015 a@a.com NA
#5: 2015 a@a.com NA
#6: 2014 a@a.com NA
#7: 2015 b@b.com b@b.com
#8: 2014 b@b.com b@b.com
#9: 2014 b@b.com b@b.com
#10: 2015 c@c.com c@c.com
df1 <- structure(list(start_year = c(2016L, 2016L, 2016L, 2015L, 2015L,
2014L, 2015L, 2014L, 2014L, 2015L), email = c("a@a.com", "a@a.com",
"a@a.com", "a@a.com", "a@a.com", "a@a.com", "b@b.com", "b@b.com",
"b@b.com", "c@c.com")), .Names = c("start_year", "email"),
class = "data.frame", row.names = c(NA, -10L))