删除行中的重复字符串

时间:2014-11-27 15:19:34

标签: r duplicates collapse

以下问题:

我的数据框data1包含一个包含几个条目的变量:

data1 <- data.frame(v1 = c("test, test, bird", "bird, bird", "car"))

现在我想删除每一行中的重复条目。结果应如下所示:

data1.final <- data.frame(v1 = c("test, bird", "bird", "car"))

我试过了:

data1$ID <- 1:nrow(data1)
data1$v1 <- as.character(data1$v1)

data1 <- split(data1, data1$ID)
reduce.words <- function(x) {
  d <- unlist(strsplit(x$v1, split=" "))
  d <- paste(d[-which(duplicated(d))], collapse = ' ')
  x$v1 <- d 
  return(x)
}
data1 <- lapply(data1, reduce.words)
data1 <- as.data.frame(do.call(rbind, data1))

但是,这会产生空行,但第一行除外。有人想解决这个问题吗?

2 个答案:

答案 0 :(得分:5)

您似乎有一个相当复杂的工作流程。如何创建一个适用于行的简单函数

reduce_row = function(i) {
  split = strsplit(i, split=", ")[[1]]
  paste(unique(split), collapse = ", ") 
}

然后使用apply

data1$v2 = apply(data1, 1, reduce_row)

获取

R> data1
                v1         v2
1 test, test, bird test, bird
2       bird, bird       bird
3              car        car

答案 1 :(得分:3)

使用cSplit

中的splitstackshape的另一个选项
library(splitstackshape)
cSplit(cbind(data1, indx=1:nrow(data1)), 'v1', ', ', 'long')[,
        toString(v1[!duplicated(v1)]), 
                                  by=indx][,indx:=NULL][]
  #          V1
  #1: test, bird
  #2:       bird
  #3:        car

或者@Ananda Mahto在评论中提到

 unique(cSplit(as.data.table(data1, keep.rownames = TRUE),
                    "v1", ",", "long"))[, toString(v1), by = rn]

 #   rn         V1
 #1:  1 test, bird
 #2:  2       bird
 #3:  3        car