我有一个大型矢量(100M元素)的单词类型:
words <- paste(letters,letters,letters,letters,sep="_")
(实际数据中的单词并非全部相同,但全部为8)
我想将它们转换为数据框,每个字母的每个字母都有一列,每个字都有一行。为此,我在结果上尝试了str_split_fixed
和rbind
,但在大型矢量R上冻结/需要永久。
如此期望的表格输出:
l1 l2 l3 l4
1 a a a a
2 b b b b
3 c c c c
有更快的方法吗?
答案 0 :(得分:7)
paste()
将向量元素折叠在一起fread()
将折叠的矢量解析为data.table / data.frame 作为一项功能:
collapse2fread <- function(x,sep) {
require(data.table)
fread(paste0(x,collapse="\n"),sep=sep,header=FALSE)
}
还可以尝试通过Rcpp
包在c ++中实现它,以获得更多功能吗?类似的东西:
std::string collapse_cpp(CharacterVector subject,const std::string collapseBy){
int n = subject.size();
std::string collapsed;
for(int i=0;i<n;i++){
collapsed += std::string(subject[i]) + collapseBy;
}
return(collapsed);
}
然后我们得到:
collapse_cpp2fread <- function(x,sep) {
require(data.table)
fread(collapse_cpp(x,collapse="\n"),sep=sep,header=FALSE)
}
microbenchmark(
paste0(words,collapse="\n"),
collapse_cpp(words,"\n"),
times=100)
并不多,但它很有意思:
> Unit: microseconds
> expr min lq median uq max neval
> paste0(words, collapse = "\\n") 7.297 7.7695 8.162 8.4255 33.824 100
> collapse_cpp(words, "\\n") 4.477 5.0095 5.117 5.3525 17.052 100
进行更真实的输入
words <- rep(paste0(letters[1:8], collapse = '_'), 1e5) # 100K elements
基准:
microbenchmark(
do.call(rbind, strsplit(words, '_')),
fread(paste0(words,collapse="\n"),sep="_",header=FALSE),
fread(collapse_cpp(words,"\n"),sep="_",header=FALSE),
times=10)
给出:
> Unit: milliseconds
> expr min lq median uq
> do.call(rbind, strsplit(words, "_")) 782.71782 796.19154 822.73694 854.22211
> fread(paste0(words, collapse = "\\n"), sep = "_", header = FALSE) 62.56164 64.13504 68.22512 71.96075
> fread(collapse_cpp(words, "\\n"), sep = "_", header = FALSE) 47.16362 47.78030 50.12867 52.23102
> max neval
> 863.0790 10
> 151.5969 10
> 109.9770 10
这么大约20倍的改进?希望它有所帮助!
答案 1 :(得分:2)
扩展基于Rcpp的解决方案。如果你可以假定输入的结构,那么很容易在Rcpp中完成所有这一切,只需要最少的数据复制。
// [[Rcpp::export]]
List bazinga( CharacterVector txt, int nc ){
int n = txt.size() ;
std::vector<CharacterVector> columns(nc) ;
for( int i=0; i<nc; i++){
columns[i] = CharacterVector(n) ;
}
std::string tmp ;
for( int i=0; i<n; i++){
const char* p = txt[i];
for(int j=0; j<nc; j++){
tmp = *p ;
columns[j][i] = tmp ;
p +=2 ;
}
}
List out = wrap(columns) ;
return out ;
}
我明白了:
> microbenchmark(f(), bazinga(words, 8), collapse2fread(words,
+ "_"), collapse_cpp2fread(words, "_"), times = 10)
Unit: milliseconds
expr min lq median uq max neval
f() 830.21571 871.38955 899.07207 1001.18561 1299.15783 10
bazinga(words, 8) 26.26454 30.61620 33.37360 46.24160 64.09243 10
collapse2fread(words, "_") 59.96217 61.58535 67.20007 93.61615 97.85007 10
collapse_cpp2fread(words, "_") 46.79471 48.58391 49.99636 82.69684 119.88587 10
答案 2 :(得分:1)
如果您使用的是Unix,那么您应该利用命令行。在那里处理大数据然后将其带入R中通常会更快。在这里,我将words
向量写入文件,然后使用system
R函数中的Unix命令重新编写它。
> words <- rep(paste0(letters[1:8], collapse = '_'), 1e5)
> cat(words, file = 'out.txt', sep = '\n')
> write.table(system(' cat out.txt | tr "_" " " ', intern = TRUE),
row.names = FALSE, col.names = FALSE,
quote = FALSE, file = 'out.txt')
> head(read.table('out.txt'))
# V1 V2 V3 V4 V5 V6 V7 V8
# 1 a b c d e f g h
# 2 a b c d e f g h
# 3 a b c d e f g h
# 4 a b c d e f g h
# 5 a b c d e f g h
# 6 a b c d e f g h
典型的R do.call(rbind, ...)
方法:
f <- function()
{
x <- do.call(rbind, strsplit(words, '_'))
y <- data.frame(x)
names(y) <- paste0('l', ncol(y))
return(y)
}
> microbenchmark(f())
# Unit: milliseconds
# expr min lq median uq max neval
# f() 818.2391 959.088 964.1105 989.081 997.8625 100