Question

在R中，将包含字符数字套件（作为字符向量）的列表转换为数字的最快方法是什么？

使用以下虚拟数据：

set.seed(2)
N = 1e7
ncol = 10
myT = formatC(matrix(runif(N), ncol = ncol)) # A matrix converted to characters
# Each row is collapsed into a single suite of characters:
myT = apply(myT, 1, function(x) paste(x, collapse=' ') ) 
head(myT)

产：

[1] "0.1849 0.855 0.8272 0.5403 0.3891 0.5184 0.7776 0.5533 0.1566 0.01591"  
[2] "0.7024 0.1008 0.9442 0.8582 0.3184 0.9289 0.9957 0.1311 0.2131 0.07355" 
[3] "0.5733 0.5493 0.3915 0.4423 0.8522 0.6042 0.9265 0.006878 0.7052 0.71"   
[... etc ...]

我能做到

library(stringi) 
# In the actual dataset, the number of spaces between numbers may vary, hence "\\s+"
system.time(newT <- lapply(stri_split_regex(myT, "\\s+", omit_empty=T), as.numeric)) 
newT <- unlist(newT) # Final goal is to have a single vector of numbers

在我的Intel Core i7 2.10GHz上配备64位和16GB系统（在ubuntu下）：

   user  system elapsed 
  3.748   0.008   3.757

使用真实数据集（ncol=150和N~1e9），这太长了。还有更好的选择吗？

Answer 1

这是我系统的两倍：

x <- paste(myT, collapse = "\n")
library(data.table)
DT <- fread(x)
newT2 <- c(t(DT))

Answer 2

我会建议使用“iotools”软件包，特别是mstrsplit函数。你可以这样做：

library(iotools)
newT <- as.vector(t(mstrsplit(myT, sep = " ", ncol = 10, type = "numeric")))

获取“iotools”套餐on GitHub。

时间比较：

OPFun <- function(myT) {
  newT <- lapply(stri_split_regex(myT, "\\s+", omit_empty=T), as.numeric)
  unlist(newT)
}

RolandFun <- function(myT) {
  x <- paste(myT, collapse = "\n")
  DT <- fread(x)
  newT2 <- c(t(DT))
  newT2
}

AMFun <- function(myT) {
  as.vector(t(mstrsplit(myT, sep = " ", ncol = 10, type = "numeric")))
}

system.time(OP <- OPFun(myT))
#    user  system elapsed 
#   3.920   0.004   3.917 
system.time(Roland <- RolandFun(myT))
#    user  system elapsed 
#   3.156   0.020   3.175 
system.time(AM <- AMFun(myT))
#    user  system elapsed 
#   0.664   0.016   0.676 

all.equal(OP, Roland)
# [1] TRUE
all.equal(Roland, AM)
# [1] TRUE

Answer 3

mstrsplit(myT, sep = " ", type = "numeric")[, 1]稍快一点。请注意，服务顺序会影响性能。 unlist(lapply(x, as.numeric))比as.numeric(unlist(x))

慢

set.seed(2)
N = 1e4
ncol = 10
myT = formatC(matrix(runif(N), ncol = ncol)) # A matrix converted to characters
myT = apply(myT, 1, function(x) paste(x, collapse=' ') ) 
head(myT)

library(microbenchmark)
library(stringi) 
library(data.table)
library(iotools)
microbenchmark(
  original = {
    newT <- lapply(stri_split_regex(myT, "\\s+", omit_empty=T), as.numeric)
    unlist(newT)
  },
  data.table = {
    x <- paste(myT, collapse = "\n")
    DT <- fread(x)
    c(t(DT))
  },
  iotools = {
    as.vector(t(mstrsplit(myT, sep = " ", ncol = 10, type = "numeric")))
  },
  strsplit = {
    as.numeric(unlist(strsplit(myT, " ")))
  },
  original2 = {
     as.numeric(unlist(stri_split_regex(myT, "\\s+", omit_empty = TRUE)))
  },
  iotools2 = {
    mstrsplit(myT, sep = " ", type = "numeric")[, 1]
  }
)
Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval   cld
   original 52.03538 53.56949 56.02025 54.27165 55.40487  94.45513   100   c  
 data.table 93.10810 94.63730 98.04845 95.41537 96.51202 212.66666   100     e
    iotools 18.73776 19.44485 21.00974 19.75573 20.05614  42.47620   100 a    
   strsplit 67.04637 69.24053 70.58916 69.86529 70.95980  84.86132   100    d 
  original2 48.25558 49.47346 51.49833 50.14377 50.96139  84.22928   100  b   
   iotools2 18.53165 19.19126 19.72922 19.52567 19.71340  32.48726   100 a

将字符向量列表转换为R中的数字的最快方法

3 个答案: