我有data.frame
:
set.seed(1)
n=20
df <- data.frame(s1 = paste(sample(0:3, n, replace = TRUE),sample(0:3, n, replace = TRUE),sep="/"),
s2 = paste(sample(0:3, n, replace = TRUE),sample(0:3, n, replace = TRUE),sep="/"),
s3 = paste(sample(0:3, n, replace = TRUE),sample(0:3, n, replace = TRUE),sep="/"),
stringsAsFactors = FALSE)
实际上,列数约为1,000,行数约为1,000,000。
将每个字段中的"/"
字符拆分为两个data.frame的有效方法是什么?
这是一种方式,使用mclapply
:
library(parallel)
split.mat = do.call(rbind,mclapply(1:nrow(df), function(x) {
mat = sapply(df[x,1:ncol(df)], function(y) strsplit(y, split = "\\/")[[1]])
return(c(mat[1,],mat[2,]))
}, mc.core = 10))
但我想知道是否有更高效的
答案 0 :(得分:3)
这里有点奇怪:
library(data.table)
fwrite(df, sep = "/", quote = FALSE,
col.names = FALSE, file = "df.txt")
NN <- 2L*ncol(df)
DT1 <- fread("df.txt", sep = "/", select = seq(from = 1L, to = NN, by = 2L))
DT2 <- fread("df.txt", sep = "/", select = seq(from = 2L, to = NN, by = 2L))
答案 1 :(得分:0)
建议:使用stri_split_fixed ...下面显示的一些基准测试...... (代码假定您以矩阵形式读取数据,然后将其转换为字符向量,使用'/'拆分,然后矩阵(prevOutput,nrow = origNrow,ncol = 2 * origNcol)
options(stringsAsFactors=F)
library(rbenchmark)
library(stringi)
library(tidyr)
set.seed(1)
ncols <- 1
nrows <- 10*1000
strdat <- paste(sample(0:3, nrows*ncols, replace=T),
sample(0:3, nrows*ncols, replace=T), sep="/")
benchmark(strsplitMtd=lapply(strdat, function(x) strsplit(x,"/")[[1]]),
striMtd=stri_list2matrix(stri_split_fixed(strdat, "/"), byrow=T),
tidyrMtd=separate(data.frame(S=strdat), S, c("S1","S2"), "/"))
请告诉我你是否需要更多或者如果我得罪任何礼节