R选择>电子邮件地址一年中2个类似的开始日期

时间:2016-09-02 07:51:26

标签: r

我想在新列中的R到NA的数据框中设置唯一的电子邮件地址超过2个相似的起始年份。

start_year email
2016       a@a.com
2016       a@a.com
2016       a@a.com
2015       a@a.com
2015       a@a.com
2014       a@a.com
2015       b@b.com
2014       b@b.com
2014       b@b.com
2015       c@c.com

结果(a@a.com在2016年有3个类似的起始年份,因此在新专栏中设置为NA):

start_year email    email_new
2016       a@a.com  NA
2016       a@a.com  NA
2016       a@a.com  NA
2015       a@a.com  NA
2015       a@a.com  NA
2014       a@a.com  NA
2015       b@b.com  b@b.com
2014       b@b.com  b@b.com
2014       b@b.com  b@b.com
2015       c@c.com  c@c.com

到目前为止,我有一个错误:所有select()输入必须解析为整数列位置。:

result <- df %>%
  group_by(email) %>%
  select(length(unique(start_year)) > 2)

非常感谢任何帮助。

3 个答案:

答案 0 :(得分:0)

根据我的理解,使用dplyr,将email转换为NA有两个条件。

1)start_year中至少有3个是相同的

2)有超过2个观察结果

 df %>% 
    group_by(email) %>% 
    mutate(new = ifelse(length(which(table(start_year) > 2)) > 0 & n()>2, 'NA', as.character(email)))

#Source: local data frame [7 x 3]
#Groups: email [3]

#  start_year   email     new
#       <int>   <chr>   <chr>
#1       2016 a@a.com    <NA>
#2       2016 a@a.com    <NA>
#3       2016 a@a.com    <NA>
#4       2015 b@b.com b@b.com
#5       2014 b@b.com b@b.com
#6       2014 b@b.com b@b.com
#7       2015 c@c.com c@c.com

2014添加另一个b@b.com,从而使该电子邮件的年份相同,然后

df1 %>% 
     group_by(email) %>% 
     mutate(new = ifelse(length(which(table(start_year) > 2)) > 0 & n()>2, 'NA', as.character(email)))

#Source: local data frame [8 x 3]
#Groups: email [3]

#  start_year   email     new
#       <dbl>   <chr>   <chr>
#1       2016 a@a.com      NA
#2       2016 a@a.com      NA
#3       2016 a@a.com      NA
#4       2015 b@b.com      NA
#5       2014 b@b.com      NA
#6       2014 b@b.com      NA
#7       2014 b@b.com      NA
#8       2015 c@c.com c@c.com

数据

dput(df)
structure(list(start_year = c(2016L, 2016L, 2016L, 2015L, 2014L, 
2014L, 2015L), email = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L
), .Label = c("a@a.com", "b@b.com", "c@c.com"), class = "factor")), .Names = c("start_year", 
"email"), class = "data.frame", row.names = c(NA, -7L))

dput(df1)
structure(list(start_year = c(2016, 2016, 2016, 2015, 2014, 2014, 
2014, 2015), email = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 
3L), .Label = c("a@a.com", "b@b.com", "c@c.com"), class = "factor")), row.names = c(NA, 
-8L), .Names = c("start_year", "email"), class = "data.frame")

答案 1 :(得分:0)

试图在base R

中执行此操作
tbl <- table(df$email, df$start_year) > 2
df$email_new <- ifelse(df$email %in% rownames(tbl)[row(tbl)[which(tbl)]], NA,
                                                        as.character(df$email))
df
#  start_year  email    email_new
#1        2016 a@a.com      <NA>
#2        2016 a@a.com      <NA>
#3        2016 a@a.com      <NA>
#4        2015 a@a.com      <NA>
#5        2015 a@a.com      <NA>
#6        2014 a@a.com      <NA>
#7        2015 b@b.com   b@b.com
#8        2014 b@b.com   b@b.com
#9        2014 b@b.com   b@b.com
#10       2015 c@c.com   c@c.com

每年查找每封电子邮件的频率(table(df$email, df$start_year))。找出任何特定年份频率超过2的email的名称,并将该行转换为NA

答案 2 :(得分:0)

我们可以使用data.table。我们转换了&#39; data.frame&#39;到&#39; data.table&#39; (setDT(df1)),按&#39;电子邮件&#39;分组if length元素的unique uniqueN大于2,将其指定为NAelse&#39;电子邮件& #39;创建&#39; email_new&#39;。

library(data.table)
setDT(df1)[, email_new := if(uniqueN(start_year)>2) NA_character_ else email,  by = email]

df1
#    start_year   email email_new
#1:       2016 a@a.com        NA
#2:       2016 a@a.com        NA
#3:       2016 a@a.com        NA
#4:       2015 a@a.com        NA
#5:       2015 a@a.com        NA
#6:       2014 a@a.com        NA
#7:       2015 b@b.com   b@b.com
#8:       2014 b@b.com   b@b.com
#9:       2014 b@b.com   b@b.com
#10:      2015 c@c.com   c@c.com

数据

df1 <- structure(list(start_year = c(2016L, 2016L, 2016L, 2015L, 2015L, 
2014L, 2015L, 2014L, 2014L, 2015L), email = c("a@a.com", "a@a.com", 
"a@a.com", "a@a.com", "a@a.com", "a@a.com", "b@b.com", "b@b.com", 
"b@b.com", "c@c.com")), .Names = c("start_year", "email"), 
 class = "data.frame", row.names = c(NA, -10L))