我正在尝试将3列合并为一个。列值用“;”分隔并且新列需要解压缩所有3个列的值并放入唯一值。我知道如何执行合并列。但是我很努力地将行值解压缩为3列,并找到唯一的值并放入另一列。
这是虚拟数据
n = c(2, 3, 5,10)
s = c("aa;bb;cc", "bb;dd;aa", "NA","xx;nn")
b = c("aa;bb;cc", "bb;dd;cc", "zz;bb;yy","NA")
t = c("aa;bb;cc", "bb;dd", "kk","NA")
df = data.frame(n, s, b,t)
> df
n s b t
1 2 aa;bb;cc aa;bb;cc aa;bb;cc
2 3 bb;dd;aa bb;dd;cc bb;dd
3 5 NA zz;bb;yy kk
4 10 xx;nn NA NA
预期输出为
> df
n finalcol
1 2 aa;bb;cc
2 3 bb;dd;aa;cc
3 5 zz;bb;yy;kk
4 10 xx;nn
我必须执行的简单合并
dff = df %>% unite(finalcol, c(s,b,t), sep = ";", remove = TRUE)
答案 0 :(得分:3)
自从您提到unite
以来,我想展示使用separate
(unite
的补语)的解决方案。
此解决方案将其保留在tidyverse
中,这使您可以轻松地逐步了解正在发生的事情。 @ d.b在评论中的答案非常有效,紧凑,运行速度可能更快,但是学习曲线更陡峭,可以了解正在发生的情况。使用管道tidyverse
解决方案,您可以运行每一行并查看发生了什么。
此解决方案首先separate
设置术语,然后使用gather
将数据从宽数据格式转换为长数据格式,以便我们可以执行诸如检查和处理NA和“ NA”的操作,drop_na
,然后是distinct
,仅获得唯一值(每个具有相同“ id”的组,即来自同一原始行的项目)。然后,它使用summarise
和paste
返回原始格式,但也可以使用spread
然后使用unite
。 (请注意,na.rm=TRUE
是unite
https://github.com/tidyverse/tidyr/issues/203的新功能)
资料来源:我使用了这些方便的dplyr
和tidyr
参考表:
https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf,我还根据此处的评论,问题和答案制定了解决方案:How do I remove NAs with the tidyr::unite function?
# Load packages and data
library(tidyverse)
df = data.frame(n = c(2, 3, 5,10),
s = c("aa;bb;cc", "bb;dd;aa", "NA","xx;nn"),
b = c("aa;bb;cc", "bb;dd;cc", "zz;bb;yy","NA"),
t = c("aa;bb;cc", "bb;dd", "kk", NA))
# Solution
dff <- df %>%
separate(col = "s", into = c("s1", "s2", "s3")) %>%
separate(col = "b", into = c("b1", "b2", "b3")) %>%
separate(col = "t", into = c("t1", "t2", "t3")) %>% # Solution here could be enhanced to take in n columns and put them into however many columns as needed, using map or apply.
rowid_to_column('id') %>%
gather(key, value, -(id:n)) %>%
mutate_at(vars(value), na_if, "NA") %>%
drop_na(value) %>%
group_by(id) %>%
distinct(value, .keep_all = TRUE) %>%
summarise(n = first(n), finalcol = paste(value, collapse = ';')) %>%
ungroup() %>%
select(-id)
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [3,
#> 4].
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [2,
#> 3].
dff
#> # A tibble: 4 x 2
#> n finalcol
#> <dbl> <chr>
#> 1 2 aa;bb;cc
#> 2 3 bb;dd;aa;cc
#> 3 5 zz;bb;yy;kk
#> 4 10 xx;nn
由reprex package(v0.2.1)于2019-03-26创建