根据唯一值合并3列?

时间:2019-03-26 18:07:04

标签: r dataframe merge dplyr unique

我正在尝试将3列合并为一个。列值用“;”分隔并且新列需要解压缩所有3个列的值并放入唯一值。我知道如何执行合并列。但是我很努力地将行值解压缩为3列,并找到唯一的值并放入另一列。

这是虚拟数据

n = c(2, 3, 5,10) 
s = c("aa;bb;cc", "bb;dd;aa", "NA","xx;nn") 
b = c("aa;bb;cc", "bb;dd;cc", "zz;bb;yy","NA") 
t = c("aa;bb;cc", "bb;dd", "kk","NA") 
df = data.frame(n, s, b,t)

> df
   n        s        b        t
1  2 aa;bb;cc aa;bb;cc aa;bb;cc
2  3 bb;dd;aa bb;dd;cc    bb;dd
3  5       NA zz;bb;yy       kk
4 10    xx;nn       NA       NA

预期输出为

> df
   n  finalcol
1  2 aa;bb;cc
2  3 bb;dd;aa;cc
3  5 zz;bb;yy;kk
4 10 xx;nn

我必须执行的简单合并

dff = df %>% unite(finalcol, c(s,b,t), sep = ";", remove = TRUE)

1 个答案:

答案 0 :(得分:3)

自从您提到unite以来,我想展示使用separateunite的补语)的解决方案。

此解决方案将其保留在tidyverse中,这使您可以轻松地逐步了解正在发生的事情。 @ d.b在评论中的答案非常有效,紧凑,运行速度可能更快,但是学习曲线更陡峭,可以了解正在发生的情况。使用管道tidyverse解决方案,您可以运行每一行并查看发生了什么。

此解决方案首先separate设置术语,然后使用gather将数据从宽数据格式转换为长数据格式,以便我们可以执行诸如检查和处理NA和“ NA”的操作,drop_na,然后是distinct,仅获得唯一值(每个具有相同“ id”的组,即来自同一原始行的项目)。然后,它使用summarisepaste返回原始格式,但也可以使用spread然后使用unite。 (请注意,na.rm=TRUEunite https://github.com/tidyverse/tidyr/issues/203的新功能)

资料来源:我使用了这些方便的dplyrtidyr参考表: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf,我还根据此处的评论,问题和答案制定了解决方案:How do I remove NAs with the tidyr::unite function?

# Load packages and data
library(tidyverse)
df = data.frame(n = c(2, 3, 5,10), 
                s = c("aa;bb;cc", "bb;dd;aa", "NA","xx;nn"),
                b = c("aa;bb;cc", "bb;dd;cc", "zz;bb;yy","NA"), 
                t = c("aa;bb;cc", "bb;dd", "kk", NA))

# Solution
dff <- df %>% 
  separate(col = "s", into = c("s1", "s2", "s3")) %>%
  separate(col = "b", into = c("b1", "b2", "b3")) %>%
  separate(col = "t", into = c("t1", "t2", "t3")) %>% # Solution here could be enhanced to take in n columns and put them into however many columns as needed, using map or apply. 
  rowid_to_column('id') %>% 
  gather(key, value, -(id:n)) %>% 
  mutate_at(vars(value), na_if, "NA") %>%
  drop_na(value) %>%
  group_by(id) %>%
  distinct(value, .keep_all = TRUE) %>%
  summarise(n = first(n), finalcol = paste(value, collapse = ';')) %>% 
  ungroup() %>% 
  select(-id)
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [3,
#> 4].
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [2,
#> 3].
dff
#> # A tibble: 4 x 2
#>       n finalcol   
#>   <dbl> <chr>      
#> 1     2 aa;bb;cc   
#> 2     3 bb;dd;aa;cc
#> 3     5 zz;bb;yy;kk
#> 4    10 xx;nn

reprex package(v0.2.1)于2019-03-26创建