将data.table中的列拆分为多行

时间:2016-01-11 01:39:59

标签: r data.table

我经常有一个表格,其中一个单元格可能包含多个值(除以一些字符分隔符),我需要拆分这些记录,例如:

#    V1 V2 V3
# 1:  x  b  1
# 2:  x  c  1
# 3:  x  d  1
# 4:  y  d  2
# 5:  y ef  2
# 6:  z  d  3
# 7:  z ef  3

应该是这样的:

# I omit all error-checking code here and assume that
# dtInput   is a valid data.table and
# col2split is a name of existing column
splitcol2rows <- function(dtInput, col2split, sep){
  ori.names <- names(dtInput); # save original order of columns
  ori.keys  <-   key(dtInput); # save original keys

  # create new table with 2 columns:
  # one is original "un-splitted" column (will be later used as a key)
  # and second one is result of strsplit:
  dt.split <- dtInput[, 
                    .(tmp.add.col=rep(unlist(strsplit(get(col2split),sep,T)), .N)),
                    by=col2split]
  dt.split <- unique(dt.split, by=NULL);

  # now use that column as a key:
  setkeyv(dt.split, col2split)
  setkeyv(dtInput, col2split)
  dtInput <- dt.split[dtInput, allow.cartesian=TRUE];

  # leave only 'splitted' column
  dtInput[, c(col2split):=NULL]; 
  setnames(dtInput, 'tmp.add.col', col2split); 

  # restore original columns order and keys
  setcolorder(dtInput, ori.names);
  setkeyv(dtInput, ori.keys);

  return(dtInput);
}

到目前为止,我做了以下功能:

splitcol2rows(dt1, 'V2', ';')[]

它工作正常(检查示例输出为dt.split),但我确信这个解决方案远非最佳,并且会对任何建议表示感谢。例如,我查看了Matt在问题“Applying a function to each row of a data.table”的答案中提出的解决方案,我喜欢它管理而不创建中间表(我的splitcol2rows_mget <- function(dtInput, col2split, sep){ dtInput <- dtInput[, .(tmp.add.col = unlist(strsplit(get(col2split),sep,T))), by=names(dtInput)] dtInput[, c(col2split):=NULL]; setnames(dtInput, 'tmp.add.col', col2split); return(dtInput); } ),但在我的情况下,我需要保留所有其他列,否则不会看到如何做到这一点。

UPD 即可。首先,从@RichardScriven提出的解决方案开始,我开始重新编写我的函数,因此它变得更短更容易阅读:

cSplit()

它仍然有一些丑陋的部分,比如中间'tmp.add.col'列,如果原始表中已经存在这样的列,则可能会导致冲突。此外,这个较短的解决方案比我的第一个代码工作得慢。它们都比splitstackshape包中的require('microbenchmark') require('splitstackshape') splitMy1 <- function(input){return(splitcol2rows(input, col2split = 'V2', sep = ';'))} splitMy2 <- function(input){return(splitcol2rows_mget(input, col2split = 'V2', sep = ';'))} splitSH <- function(input){return(cSplit(input, splitCols = 'V2', sep = ';', direction = 'long'))} # Smaller table, 100 repeats: set.seed(1) num.rows <- 1e4; dt1 <- data.table(V1=seq_len(num.rows), V2=replicate(num.rows,paste0(sample(letters, runif(1,1,6), T), collapse = ";")), V3=rnorm(num.rows)) print(microbenchmark(splitMy1(dt1), splitMy2(dt1), splitSH(dt1), times=100L)) #Unit: milliseconds # expr min lq mean median uq max neval # splitMy1(dt1) 56.34475 58.53842 68.11128 62.51419 79.79727 98.96797 100 # splitMy2(dt1) 61.84215 64.59619 76.41503 69.02970 88.49229 132.43679 100 # splitSH(dt1) 31.29671 33.14389 38.28108 34.91696 39.31291 83.58625 100 # Bigger table, 1 repeat: set.seed(1) num.rows <- 5e5; dt1 <- data.table(V1=seq_len(num.rows), V2=replicate(num.rows,paste0(sample(letters, runif(1,1,6), T), collapse = ";")), V3=rnorm(num.rows)) print(microbenchmark(splitMy1(dt1), splitMy2(dt1), splitSH(dt1), times=1L)) #Unit: seconds # expr min lq mean median uq max neval # splitMy1(dt1) 2.955825 2.955825 2.955825 2.955825 2.955825 2.955825 1 # splitMy2(dt1) 3.693612 3.693612 3.693612 3.693612 3.693612 3.693612 1 # splitSH(dt1) 1.990201 1.990201 1.990201 1.990201 1.990201 1.990201 1 慢:

{{1}}

1 个答案:

答案 0 :(得分:4)

splitstackshape中的一个名为cSplit的函数非常适合此任务。只需通过&#34;;&#34;作为分隔符和&#34; long&#34;作为获得我们需要的方向。

> library(splitstackshape)
> dat <- data.frame(V1 = c("x", "y", "z"), V2 = c("b;c;d", "d;ef", "d;ef"), V3 = 1:3, stringsAsFactors = FALSE)
> cSplit(dat, "V2", sep = ";", direction = "long")
#   V1 V2 V3
# 1:  x  b  1
# 2:  x  c  1
# 3:  x  d  1
# 4:  y  d  2
# 5:  y ef  2
# 6:  z  d  3
# 7:  z ef  3