合并几乎相同的行,过滤NA和较短的字符串

时间:2018-03-21 22:24:45

标签: r dataframe row

我在数据框中有一些几乎相同的行,参见例如,建立它们相关的标准是本例中的一些变量“sel1,sel2”,其他变量var1和var2必须通过以下方式集成条件:1。丢弃NA,或2.丢弃较短的字符串(在示例中的var2中)。所以,直到现在我已经放弃了NA,但没有找到一种方法同时丢弃较短的字符串。字符串很复杂,可能有逗号,空格和几种类型的字符。

df <- read.table(text = 
            "  sel1 sel2 var1    var2
1   pseudorepeated1   x    NA    \"longer string\"   # keep longer string instead of shortstring
2   pseudorepeated1   x    2     \"short string\"    # keep 2 instead of NA
3   pseudorepeated2   y    NA    \"longer string 2\" # keep longer string 2
4   pseudorepeated2   y    4     \"short string2\"   # keep 4
5                 3   x    gs    as
6                 4   y    fg    df
7                 5   x    eg    af
8                 6   y    df    fd", header = TRUE, stringsAsFactors=F)
df
df[is.na(df)] <- ""
df2<-aggregate(. ~ sel1 + sel2,data=df,FUN=function(X)paste(unique((X))) )
paste_noNA <- function(x,sep=", ") 
  gsub(", " ,sep, toString(x[!is.na(x) & x!="" & x!="NA"] ) )
df3<-as.data.frame(lapply(df2, function(X) unlist(lapply(X, function(x) paste_noNA(x)) ) ), 
                           stringsAsFactors=F )

预期输出在此表中没有“,短字符串”文本。

df3
               sel1 sel2 var1                        var2
1.1               3    x   gs                          as
1.3               5    x   eg                          af
1.5 pseudorepeated1    x    2 longer string, short string# only longer string desired
2.2               4    y   fg                          df
2.4               6    y   df                          fd
2.6 pseudorepeated2    y    4 longer string 2, short string2# only longer string 2 desired

1 个答案:

答案 0 :(得分:2)

sel1sel2分组并删除var1中的NA,并在var2中将较短的字符串替换为较长的字符串。最后,删除其中的重复项。

library('data.table')
setDT(df)
df[, `:=` ( var2 = { temp <- nchar(var2); var2[ temp == max(temp) ] },
            var1 = na.omit(var1)),
   by = .(sel1, sel2)]
df[ !duplicated( df ), ]

#               sel1 sel2 var1         var2
# 1: pseudorepeated1    x    2 longerstring
# 2: pseudorepeated2    y    4 longerstring
# 3:               3    x   gs           as
# 4:               4    y   fg           df
# 5:               5    x   eg           af
# 6:               6    y   df           fd

编辑:有很多专栏

数据:

df <- read.table(text = 
                   "  sel1 sel2 var1    var2
                 1   pseudorepeated1   x    NA    longerstring   # keep longerstring instead of shortstring
                 2   pseudorepeated1   x    2     shortstring    # keep 2 instead of NA
                 3   pseudorepeated2   y    NA    longerstring   # same as above
                 4   pseudorepeated2   y    4     shortstring    # same as above
                 5                 3   x    gs    as
                 6                 4   y    fg    df
                 7                 5   x    eg    af
                 8                 6   y    df    fd", header = TRUE, stringsAsFactors=F)

library('data.table')
setDT(df)
df$var3 <- df$var2
df$var4 <- df$var2

代码:

for( nm in c( "var1", "var2", "var3", "var4") ){
  df[,  paste0(nm) := { temp <- na.omit(get(nm)); temp[ nchar(temp) == max(nchar(temp)) ] },
     by = .(sel1, sel2)]
}
df[ !duplicated( df ), ]

输出:

#               sel1 sel2 var1         var2         var3         var4
# 1: pseudorepeated1    x    2 longerstring longerstring longerstring
# 2: pseudorepeated2    y    4 longerstring longerstring longerstring
# 3:               3    x   gs           as           as           as
# 4:               4    y   fg           df           df           df
# 5:               5    x   eg           af           af           af
# 6:               6    y   df           fd           fd           fd

编辑2:避免for循环,并使用.SDcols和列名变量

col_nm <- c( "var1", "var2", "var3", "var4")

df[,  paste0(col_nm) := lapply( .SD, function(x) { 
  temp <- na.omit(x)
  temp[ nchar(temp) == max(nchar(temp)) ] } ),
  by = .(sel1, sel2), 
  .SDcols = col_nm ]  

df[ !duplicated( df ), ]