Question

我有一个大的data.table（9 M行），有两列：fcombined和value fcombined是一个因素，但它实际上是两个因素相互作用的结果。现在的问题是，将一个因子列再次拆分为两个最有效的方法是什么？我已经提出了一个可以正常运行的解决方案，但也许还有更多我想念的直接方式。工作示例是：

library(stringr)
f1=1:20
f2=1:20
g=expand.grid(f1,f2)
combinedfactor=as.factor(paste(g$Var1,g$Var2,sep="_"))
largedata=1:10^6
DT=data.table(fcombined=combinedfactor,value=largedata)


splitfactorcol=function(res,colname,splitby="_",namesofnewcols){#the nr. of cols retained is length(namesofnewcols)
  helptable=data.table(.factid=seq_along(levels(res[[colname]])) ,str_split_fixed(levels(res[[colname]]),splitby,length(namesofnewcols)))
  setnames(helptable,colnames(helptable),c(".factid",namesofnewcols))
  setkey(helptable,.factid)
  res$.factid=unclass(res[[colname]])
  setkey(res,.factid)
  m=merge(res,helptable)
  m$.factid=NULL
  m
}
splitfactorcol(DT,"fcombined",splitby="_",c("f1","f2"))

Answer 1

我认为这样做很快，速度提高了5倍。

setkey(DT, fcombined)
DT[DT[, data.table(fcombined = levels(fcombined),
                   do.call(rbind, strsplit(levels(fcombined), "_")))]]

我拆分了这些级别，然后简单地将该结果合并回原来的data.table。

顺便说一句，在我的测试中strsplit比stringr函数快2倍（对于此任务）。

在r data.table中将组合因子列拆分为两个因子列的最有效方法是什么？

1 个答案: