在R合并数据中组合两行并删除重复项

时间:2017-12-29 04:14:17

标签: r database csv data-cleaning

我正在使用电子邮件及其各自的信息来清理数据库。有些电子邮件不止一次出现,但从一行到另一行的信息是互补的。所以我想用电子邮件作为关键来组合行。如果信息刚刚复制,请删除电子邮件。

我的数据库是一个csv文件,使用read.csv将其转换为数据框。

输入

  EMAIL     Country     Gender        Language
1 y@y.com   US                           S
2 z@z.com   AR           female          S
3 z@z.com                female          
4 s@f.com   US           female          E
4 s@f.com   US           female          E
5 y@y.com   US           male

输出

  EMAIL     Country     Gender        Language
1 y@y.com   US           male            S
2 z@z.com   AR           female          S
3 s@f.com   US           female          E

2 个答案:

答案 0 :(得分:2)

我们可以使用dplyr。按照' EMAIL'进行分组后,使用unique

获取每列不属于空白的summarise_all个元素
library(dplyr)
df %>%
   group_by(EMAIL) %>%
   summarise_all(funs(unique(.[.!='']))) 
# A tibble: 3 x 4
# Groups: EMAIL [3]
#  EMAIL   Country Gender Language
#  <chr>   <chr>   <chr>  <chr>   
#1 y@y.com US      male   S       
#2 z@z.com AR      female S       
#3 s@f.com US      female E

数据

df <- structure(list(EMAIL = c("y@y.com", "z@z.com", "z@z.com", "s@f.com", 
"s@f.com", "y@y.com"), Country = c("US", "AR", "", "US", "US", 
"US"), Gender = c("", "female", "female", "female", "female", 
"male"), Language = c("S", "S", "", "E", "E", "")), .Names = c("EMAIL", 
 "Country", "Gender", "Language"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

答案 1 :(得分:2)

我们也可以使用aggregate作为基本R选项:

df_out <- aggregate(x=df, by=list(df$EMAIL), function(x) { max(x, na.rm=TRUE) })
df_out[order(df_out$EMAIL), -1]

    EMAIL Country Gender Language
1 s@f.com      US female        E
2 y@y.com      US   male        S
3 z@z.com      AR female        S

这里的基本思想是,我们随意为每个电子邮件密钥获取每列的最大值,同时忽略NA个值。这似乎适用于您的数据集。

数据:

df <- data.frame(EMAIL=c('y@y.com', 'z@z.com', 'z@z.com', 's@f.com', 's@f.com', 'y@y.com'),
                 Country=c('US', 'AR', NA, 'US', 'US', 'US'),
                 Gender=c(NA, 'female', 'female', 'female', 'female', 'male'),
                 Language=c('S', 'S', NA, 'E', 'E', NA), stringsAsFactors=FALSE)

enter image description here