有效地拆分data.frame

时间:2016-03-29 18:54:04

标签: r split dataframe apply

我有data.frame

set.seed(1)
n=20
df <- data.frame(s1 = paste(sample(0:3, n, replace = TRUE),sample(0:3, n, replace = TRUE),sep="/"),
                  s2 = paste(sample(0:3, n, replace = TRUE),sample(0:3, n, replace = TRUE),sep="/"),
                  s3 = paste(sample(0:3, n, replace = TRUE),sample(0:3, n, replace = TRUE),sep="/"),
                  stringsAsFactors = FALSE)

实际上,列数约为1,000,行数约为1,000,000。

将每个字段中的"/"字符拆分为两个data.frame的有效方法是什么?

这是一种方式,使用mclapply

library(parallel)
split.mat = do.call(rbind,mclapply(1:nrow(df), function(x) {
  mat = sapply(df[x,1:ncol(df)], function(y) strsplit(y, split = "\\/")[[1]])
  return(c(mat[1,],mat[2,]))
}, mc.core = 10))

但我想知道是否有更高效的

2 个答案:

答案 0 :(得分:3)

这里有点奇怪:

library(data.table)
fwrite(df, sep = "/", quote = FALSE,
       col.names = FALSE, file = "df.txt")

NN <- 2L*ncol(df)

DT1 <- fread("df.txt", sep = "/", select = seq(from = 1L, to = NN, by = 2L))
DT2 <- fread("df.txt", sep = "/", select = seq(from = 2L, to = NN, by = 2L))

答案 1 :(得分:0)

建议:使用stri_split_fixed ...下面显示的一些基准测试...... (代码假定您以矩阵形式读取数据,然后将其转换为字符向量,使用'/'拆分,然后矩阵(prevOutput,nrow = origNrow,ncol = 2 * origNcol)

options(stringsAsFactors=F)
library(rbenchmark)
library(stringi)
library(tidyr)

set.seed(1)
ncols <- 1
nrows <- 10*1000
strdat <- paste(sample(0:3, nrows*ncols, replace=T),
    sample(0:3, nrows*ncols, replace=T), sep="/")

benchmark(strsplitMtd=lapply(strdat, function(x) strsplit(x,"/")[[1]]),
    striMtd=stri_list2matrix(stri_split_fixed(strdat, "/"), byrow=T),
    tidyrMtd=separate(data.frame(S=strdat), S, c("S1","S2"), "/"))

请告诉我你是否需要更多或者如果我得罪任何礼节