按ID合并行之间的字符串

时间:2015-03-26 20:30:31

标签: regex r string merge

我希望通过id变量合并行之间的字符串。我知道如何使用下面的R代码执行此操作。但是,我的代码看起来非常复杂。

在当前情况下,每个字符串都有两个不是点的元素。 id中的每对连续行具有一个共同的元素。因此,在合并两行之后,只剩下其中一个元素。

显示所需的结果,下面的R代码返回所需的结果。谢谢你的任何建议。抱歉,我的R代码太长而且令人费解,但确实有效,我的目标是在R基础上获得更高效的代码。

my.data <- read.table(text = '
     id         my.string
      2    11..................
      2    .1...2..............
      2    .....2...3..........
      5    ....................
      6    ......2.....2.......
      6    ............2...4...
      7    .1...2..............
      7    .....2....3.........
      7    ..........3..3......
      7    .............34.....
      8    ....1.....1.........
      8    ..........12........
      8    ...........2....3...
      9    ..................44
     10    .2.......2..........
     11    ...2...2............
     11    .......2.....2......
     11    .............2...2..
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
my.data

desired.result <- read.table(text = '
     id         my.string
      2    11...2...3..........
      5    ....................
      6    ......2.....2...4...
      7    .1...2....3..34.....
      8    ....1.....12....3...
      9    ..................44
     10    .2.......2..........
     11    ...2...2.....2...2..
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)

# obtain position of first and last non-dot
# from: http://stackoverflow.com/questions/29229333/position-of-first-and-last-non-dot-in-a-string-with-regex

first.last.dot <- data.frame(my.data, do.call(rbind, gregexpr("^\\.*\\K[^.]|[^.](?=\\.*$)", my.data[,2], perl=TRUE)))

# obtain non-dot elements
first.last.dot$first.element <- as.numeric(substr(first.last.dot$my.string, first.last.dot$X1, first.last.dot$X1))
first.last.dot$last.element  <- as.numeric(substr(first.last.dot$my.string, first.last.dot$X2, first.last.dot$X2))

# obtain some book-keeping variables
first.last.dot$number.within.group <- sequence(rle(first.last.dot$id)$lengths)
most.records.per.id                <- max(first.last.dot$number.within.group)
n.ids                              <- length(unique(first.last.dot$id))

# create matrices for recording data
positions.per.id <- matrix(NA, nrow = (n.ids), ncol=(most.records.per.id+1))
values.per.id    <- matrix(NA, nrow = (n.ids), ncol=(most.records.per.id+1))

# use nested for-loops to fill matrices with data
positions.per.id[1,1] = first.last.dot$X1[1]
   values.per.id[1,1] = first.last.dot$first.element[1]

positions.per.id[1,2] = first.last.dot$X2[1]
   values.per.id[1,2] = first.last.dot$last.element[1]

j = 1

for(i in 2:nrow(first.last.dot)) {

     if(first.last.dot$id[i] != first.last.dot$id[i-1]) j = j + 1

      positions.per.id[j, (first.last.dot$number.within.group[i]+0)] = first.last.dot$X1[i]
      positions.per.id[j, (first.last.dot$number.within.group[i]+1)] = first.last.dot$X2[i]

      values.per.id[j, (first.last.dot$number.within.group[i]+0)] = first.last.dot$first.element[i]
      values.per.id[j, (first.last.dot$number.within.group[i]+1)] = first.last.dot$last.element[i]
}

# convert matrix data into new strings using nested for-loops
new.strings <- matrix(0, nrow = nrow(positions.per.id), ncol = nchar(my.data$my.string[1]))

for(i in 1:nrow(positions.per.id)) {
     for(j in 1:ncol(positions.per.id)) {

          new.strings[i,positions.per.id[i,j]] <- values.per.id[i,j]
     }
}

# format new strings
new.strings[is.na(new.strings)] <- 0
new.strings[new.strings==0]     <- '.'

new.strings2 <- data.frame(id = unique(first.last.dot$id), my.string = (do.call(paste0, as.data.frame(new.strings))), stringsAsFactors = FALSE)
new.strings2

all.equal(desired.result, new.strings2)
# [1] TRUE

3 个答案:

答案 0 :(得分:2)

在R基地做这个有点自虐,所以我不会这样做,但有了一些毅力,你可以自己做。这是data.table版本(您需要安装github的最新1.9.5版本才能获得tstrsplit):

library(data.table)
dt = as.data.table(my.data) # or setDT to convert in place

dt[, paste0(lapply(tstrsplit(my.string, ""),
                   function(i) {
                     res = i[i != "."];
                     if (length(res) > 0)
                       res[1]
                     else
                       '.'
                   }), collapse = "")
   , by = id]
#   id                   V1
#1:  2 11...2...3..........
#2:  5 ....................
#3:  6 ......2.....2...4...
#4:  7 .1...2....3..34.....
#5:  8 ....1.....12....3...
#6:  9 ..................44
#7: 10 .2.......2..........
#8: 11 ...2...2.....2...2..

答案 1 :(得分:2)

老兄,这很难。请不要让我解释我做了什么。

data.frame(id=unique(my.data$id), my.string=sapply(lapply(unique(my.data$id), function(id) gsub('^$','.',substr(gsub('\\.','',do.call(paste0,strsplit(my.data[my.data$id==id,'my.string'],''))),1,1)) ), function(x) paste0(x,collapse='') ), stringsAsFactors=F );

好的,我会解释一下:

首先是lapply()来电:

lapply(unique(my.data$id), function(id) ... )

如您所见,上面基本上遍历data.frame中的唯一ID,依次处理每个ID。这是函数的内容:

gsub('^$','.',substr(gsub('\\.','',do.call(paste0,strsplit(my.data[my.data$id==id,'my.string'],''))),1,1))

让我们从最里面的子表达式开始分成几部分:

strsplit(my.data[my.data$id==id,'my.string'],'')

以上索引当前my.string值的所有id个单元格,并使用strsplit()拆分每个字符串。这将生成list个字符向量,每个列表组件包含一个字符串向量,其中整个向量对应于已拆分的输入字符串。使用空字符串作为分隔符会使每个输入字符串中的每个单独字符成为与所述输入字符串对应的列表组件中的输出向量中的元素。

以下是上述表达式生成的示例(对于id == 2):

[[1]]
 [1] "1" "1" "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "."

[[2]]
 [1] "." "1" "." "." "." "2" "." "." "." "." "." "." "." "." "." "." "." "." "." "."

[[3]]
 [1] "." "." "." "." "." "2" "." "." "." "3" "." "." "." "." "." "." "." "." "." "."

以上strsplit()调用包含在以下内容中(...代表前一个表达式):

do.call(paste0,...)

调用paste0()一次,将strsplit()生成的输出向量作为参数传递。这会对所有向量进行元素方式的粘贴,因此对于每个唯一的id,最终会得到一个这样的向量:

 [1] "1.." "11." "..." "..." "..." ".22" "..." "..." "..." "..3" "..." "..." "..." "..." "..." "..." "..." "..." "..." "..."

以上paste0()来电包含在以下内容中:

gsub('\\.','',...)

为所有元素剥离所有文字点,导致每个唯一ID:

 [1] "1"  "11" ""   ""   ""   "22" ""   ""   ""   "3"  ""   ""   ""   ""   ""   ""   ""   ""   ""   ""

以上gsub()来电包含在以下内容中:

substr(...,1,1)

只提取每个元素的第一个字符,如果存在,则为该位置中的所需字符。空元素是可以接受的,因为这意味着id在该位置的任何输入字符串中都没有非点字符。

以上substr()来电包含在以下内容中:

gsub('^$','.',...)

这只是用文字点替换空元素,这在将字符串重新组合在一起之前显然是必要的。所以我们有,对于id == 2:

 [1] "1" "1" "." "." "." "2" "." "." "." "3" "." "." "." "." "." "." "." "." "." "."

这样就完成了lapply()调用的功能。因此,从该调用中出来将是表示所需输出字符串的list个字符向量。剩下的就是将这些向量的元素折叠回一个字符串,这就是为什么我们需要这个:

sapply(..., function(x) paste0(x,collapse='') )

使用sapply()(简化应用)是合适的,因为它会自动将所有需要的输出字符串组合成单个字符向量,而不是将它们保留为列表:

[1] "11...2...3.........." "...................." "......2.....2...4..." ".1...2....3..34....." "....1.....12....3..." "..................44" ".2.......2.........." "...2...2.....2...2.."

因此,剩下的就是生成完整的输出data.frame,类似于输入data.frame:

data.frame(id=unique(my.data$id), my.string=..., stringsAsFactors=F )

导致:

  id            my.string
1  2 11...2...3..........
2  5 ....................
3  6 ......2.....2...4...
4  7 .1...2....3..34.....
5  8 ....1.....12....3...
6  9 ..................44
7 10 .2.......2..........
8 11 ...2...2.....2...2..

我们已经完成了!

答案 2 :(得分:2)

可以使用stringidplyr套餐中的功能:

library(stringi)
library(dplyr)

# split my.string
m <- stri_split_boundaries(my.data$my.string, type = "character", simplify = TRUE)

df <- data.frame(id = my.data$id, m)

# function to apply to each column - select "." or unique "number"
myfun <- function(x) if(all(x == ".")) "." else unique(x[x != "."])


df %>%
  # for each id...
  group_by(id) %>%

  # ...and each column, apply function
  summarise_each(funs(myfun)) %>%

  # for each row...
  rowwise() %>%

 #...concatenate strings 
  do(data.frame(id = .[1], mystring = paste(.[-1], collapse = "")))

#   id             mystring
# 1  2 11...2...3..........
# 2  5 ....................
# 3  6 ......2.....2...4...
# 4  7 .1...2....3..34.....
# 5  8 ....1.....12....3...
# 6  9 ..................44
# 7 10 .2.......2..........
# 8 11 ...2...2.....2...2..