按行比较字符串列与其他列

时间:2019-05-31 05:55:12

标签: r

我有一个数据框,如下所示,如何比较2列中的值。即第1行在col a和b中都有公共字符串(“ SZY”),而col a(ABC)中有多余的字符串 对于第5行,通用字符串是“ BNM”,而在col a和b中都包含额外的字符串。

a=c("ABC,SZY","XYZ",NA,NA,"ABC,BNM,JKL","DEF","XCV")
b=c("SZY","XYZ,IOP","QWE",NA,"BNM,JKL,STU","DEF","HJK")
df = data.frame(a,b)

输出应如下

output = c("COMMON+column_a","COMMON+column_b","DIFFERENT",NA,"COMMON+column_a+column_b","COMMON","DIFFERENT")
df = cbind(df,output)

3 个答案:

答案 0 :(得分:1)

这里是基数R中的另一个,

vapply(strsplit(do.call(paste, df), " |,"), function(x) 
                               toString(unique(x[x != 'NA'])), character(1L))

#[1] "ABC, SZY"    "XYZ, IOP"    "QWE"    ""    "ABC, BNM, JKL, STU" "DEF"   "XCV, HJK"

答案 1 :(得分:0)

使用基数R apply,我们可以在逗号上分割字符串,删除NA项,仅保留unique个值,将它们再次转换为逗号分隔的字符串。

df$output <- apply(df, 1, function(x) 
                  toString(unique(na.omit(unlist(strsplit(x, ","))))))

df
#            a           b             output
#1     ABC,SZY         SZY           ABC, SZY
#2         XYZ     XYZ,IOP           XYZ, IOP
#3        <NA>         QWE                QWE
#4        <NA>        <NA>                   
#5 ABC,BNM,JKL BNM,JKL,STU ABC, BNM, JKL, STU
#6         DEF         DEF                DEF
#7         XCV         HJK           XCV, HJK

答案 2 :(得分:0)

这是cSplit的一个选项,在创建行名列之后,我们在定界符,上将数据集列拆分为'long'格式。然后按“ rn”分组,用union获取列元素的Reduce,并在原始数据集中将该列分配为“输出”

library(data.table)
library(splitstackshape)
df$output <- cSplit(setDT(df, keep.rownames = TRUE), c("a", "b"), ",", 
      "long")[, toString(Reduce(union, lapply(.SD, na.omit))), rn]$V1
df
#   rn           a           b             output
#1:  1     ABC,SZY         SZY           ABC, SZY
#2:  2         XYZ     XYZ,IOP           XYZ, IOP
#3:  3        <NA>         QWE                QWE
#4:  4        <NA>        <NA>                   
#5:  5 ABC,BNM,JKL BNM,JKL,STU ABC, BNM, JKL, STU
#6:  6         DEF         DEF                DEF
#7:  7         XCV         HJK           XCV, HJK

或者使用tidyverse,在创建行名列之后,将数据gather转换为'long'格式,在定界符,上分隔'val'行,替换NA带有,的元素,获取基于'rn'和'val'列的distinct行,将字符串按{rn'分组在一起粘贴(str_c)并绑定列“ output”原始数据集

library(tidyverse)
rownames_to_column(df, 'rn') %>% 
   gather(key, val, -rn) %>% 
   separate_rows(val) %>%
   mutate(val = replace_na(val, "")) %>%
   distinct(rn, val) %>%
   group_by(rn) %>% 
   summarise(val = str_c(val, collapse=",")) %>% 
   select(-rn) %>% 
   bind_cols(df, .)

或者使用base R,我们在定界符strsplit中用,拆分列,使用{{1}获得union元素中的list },然后Map放入一个字符串,paste unlist放入一个list,然后将其分配以创建“输出”列

vector

数据

df$output <- unlist(do.call(Map, c(f = function(...) 
        toString(union(...)), unname(lapply(df, strsplit, ",")))))