使用代码范围转换data.table

时间:2015-12-24 00:32:30

标签: r data.table

我有一个数据集,其中包含代码和标题范围:

library(data.table)
dataset <- data.table(
    start = c("A00", "A20", "C10", "F00"),
    end = c("A09", "A35", "C19", "F15"),
    title = c("title1", "title2", "title3", "title4"))

#>    start end  title
#> 1:   A00 A09 title1
#> 2:   A20 A35 title2
#> 3:   C10 C19 title3
#> 4:   F00 F20 title4

期望的结果:

#>     code  title
#>  1:  A00 title1
#>  2:  A01 title1
#>  3:  A02 title1
#>  4:  A03 title1
#>  5:  A04 title1
#>  6:  A05 title1
#>  7:  A06 title1
#>  8:  A07 title1
#>  9:  A08 title1
#> 10:  A09 title1
#> 11:  A20 title2
#> 12:  A21 title2
#> 13:  A22 title2
#> 14:  A23 title2
#> 15:  A24 title2
#> 16:  A25 title2
#> 17:  A26 title2
#> 18:  A27 title2
#> 19:  A28 title2
#> 20:  A29 title2
#> 21:  A30 title2
#> 22:  A31 title2
#> 23:  A32 title2
#> 24:  A33 title2
#> 25:  A34 title2
#> 26:  A35 title2
#> 27:  C10 title3
#> 28:  C11 title3
#> 29:  C12 title3
#> 30:  C13 title3
#> 31:  C14 title3
#> 32:  C15 title3
#> 33:  C16 title3
#> 34:  C17 title3
#> 35:  C18 title3
#> 36:  C19 title3
#> 37:  F00 title4
#> 38:  F01 title4
#> 39:  F02 title4
#> 40:  F03 title4
#> 41:  F04 title4
#> 42:  F05 title4
#> 43:  F06 title4
#> 44:  F07 title4
#> 45:  F08 title4
#> 46:  F09 title4
#> 47:  F10 title4
#> 48:  F11 title4
#> 49:  F12 title4
#> 50:  F13 title4
#> 51:  F14 title4
#> 52:  F15 title4
#>     code  title

我目前的解决方案是:

seq_code <- function(start, end) {
    letter <- substr(start, 1, 1)
    start <- substr(start, 2, 3)
    end <- substr(end, 2, 3)
    paste0(letter, sprintf("%.2d", start:end))
}

rbindlist(lapply(1:nrow(dataset), function(i) {
    dataset[i, list(code = seq_code(start, end), title = title)]
}))

是否有更优雅,更快速的解决方案?

UPD :结果我找到了基于@MichaelChirico建议的解决方案。

dataset[, list(code = seq_code(start, end)), by = title]

Bencmhark:

microbenchmark::microbenchmark(
    lapply = rbindlist(lapply(1:nrow(dataset), function(i) {
        dataset[i, list(code = seq_code(start, end), title = title)]
    })),
    by = dataset[, list(code = seq_code(start, end)), by = title]
)

#> Unit: microseconds
#> expr      min       lq      mean    median        uq      max neval cld
#> lapply 2024.874 2065.387 2166.9491 2085.2535 2149.1420 4979.722   100   b
#>     by  486.404  510.853  531.5532  519.6025  536.6735  821.413   100  a 

1 个答案:

答案 0 :(得分:5)

够好吗? (不完全是因为列被颠倒了,但setcolorder将完成这项工作,如果这是至关重要的)

dataset[ ,
        {x <- substr(start, 1, 1)
        s <- as.integer(substr(start, 2, 3))
        e <- as.integer(substr(end, 2, 3))
        .(code=paste0(x, sprintf("%02d", s:e)))}, by = title]

这里有一个我认为看起来更好的替代方案,但结果却非常混乱:

dataset[,
        {x<-unlist(lapply(
          .SD,tstrsplit,split="(?<=[[:alpha:]])",perl=T))
        .(code=paste0(
          x[1],sprintf("%02d",do.call(
            "seq",as.list(as.integer(x[c(2,4)]))))))},
        by=title]