拆分data.table的列

时间:2014-09-20 18:56:24

标签: r data.table reshape reshape2

想知道如何拆分data.table的列。

可以使用以下代码轻松拆分data.frame的列:

df <- data.frame(Test=c("A - B", "C - D"))
df
     Test
1 A - B
2 C - D
library(reshape2)
reshape2:::colsplit(string = df[,1], pattern = " ", names = c("Var1", "Space", "Var2"))

  Var1 Space Var2
1    A     -    B
2    C     -    D

但我尝试拆分data.table的列失败

library(data.table)
dt <- data.table(Test=c("A - B", "C - D"))
dt
    Test
1: A - B
2: C - D
reshape2:::colsplit(string = dt[,1, with=FALSE], pattern = " ", names = c("Var1", "Space", "Var2"))
Error: String must be an atomic vector

1 个答案:

答案 0 :(得分:1)

我看到你特意要求colsplit的某些内容,但我建议您查看其他一些替代方案,例如我的cSplit函数。

cSplit方法如下:

setnames(cSplit(dt, "Test",  " "), c("Var1", "Space", "Var2"))[]
#    Var1 Space Var2
# 1:    A     -    B
# 2:    C     -    D

最后的[]是打印结果,但您也可以将结果存储为新的data.table


在效率方面如何比较?

fun1 <- function() {
  reshape2:::colsplit(string = dt[[1]], pattern = " ", 
                      names = c("Var1", "Space", "Var2"))
} 
fun2 <- function() {
  setnames(cSplit(dt, "Test",  " "), 
           c("Var1", "Space", "Var2"))[]
}

dt <- rbindlist(replicate(5000, dt, FALSE))
dim(dt)
# [1] 10000     1

library(microbenchmark)
microbenchmark(fun1(), fun2(), times = 10)
# Unit: milliseconds
#    expr        min         lq     median         uq        max neval
#  fun1() 2025.84703 2093.39687 2195.75822 2390.30666 2492.65946    10
#  fun2()   34.08966   36.01145   43.28036   47.45962   57.57615    10

为什么您的colsplit方法没有按预期工作?

dt[,1, with=FALSE]更像是df[,1, drop = FALSE](尝试一下 - 你会得到与“data.table”尝试相同的错误。)

您需要以下任一项:

> dt[[1]]
[1] "A - B" "C - D"
> dt$Test
[1] "A - B" "C - D"

与您的相似:

> df[, 1]
[1] A - B C - D
Levels: A - B C - D