Question

我在C编程方面有很好的经验，我习惯用指针思考，所以在处理大量数据时我可以获得良好的性能。与R不同，我还在学习。

我有一个大约有100万行的文件，用＆＃39; \ n＆＃39;并且每一行内部都有1,2或更多整数，由一个＆＃39; ＆＃39 ;. 我已经能够组合一个代码来读取文件并将所有内容放入列表列表中。有些行可能是空的。然后，我想将每行的第一个数字（如果存在）放入一个单独的列表中，如果一行为空，则将其余的数字放入，将剩余的数字放入第二个列表中。

我在这里发布的代码是非常慢（自从我开始写这个问题以来它一直在运行所以现在我杀了R），我怎样才能获得一个不错的速度？在C中，这将立即完成。

graph <- function() {
    x <- scan("result", what="", sep="\n")
    y <- strsplit(x, "[[:space:]]+") #use spaces for split number in each line
    y <- lapply(y, FUN = as.integer) #convert from a list of lists of characters to a list of lists of integers
    print("here we go")
    first <- c()
    others <- c()
    for(i in 1:length(y)) {
        if(length(y[i]) >= 1) { 
            first[i] <- y[i][1]
        }
        k <- 2;
        for(j in 2:length(y[i])) {
            others[k] <- y[i][k]
            k <- k + 1
        }
    }

在以前版本的代码中，每行至少有一个数字，而我只对每行的第一个数字感兴趣，我使用了这个代码（我在任何地方都读到了我应该避免使用for循环像R）这样的语言

yy <- rapply(y, function(x) head(x,1))

这需要大约5秒钟，远远好于上面，但与C相比仍然烦人。

EDIT 这是我文件的前10行的一个例子：

42 7 31 3 
23 1 34 5 


1 
-23 -34 2 2 

42 7 31 3 31 4 

1

Answer 1

基础R与 purrr

your_list <- rep(list(list(1,2,3,4), list(5,6,7), list(8,9)), 100)

microbenchmark::microbenchmark(
  your_list %>% map(1),
  lapply(your_list, function(x) x[[1]])
)
Unit: microseconds
                                  expr       min        lq       mean    median         uq       max neval
                  your_list %>% map(1) 22671.198 23971.213 24801.5961 24775.258 25460.4430 28622.492   100
 lapply(your_list, function(x) x[[1]])   143.692   156.273   178.4826   162.233   172.1655  1089.939   100

microbenchmark::microbenchmark(
  your_list %>% map(. %>% .[-1]),
  lapply(your_list, function(x) x[-1])
)
Unit: microseconds
                                 expr     min       lq      mean   median       uq      max neval
       your_list %>% map(. %>% .[-1]) 916.118 942.4405 1019.0138 967.4370 997.2350 2840.066   100
 lapply(your_list, function(x) x[-1]) 202.956 219.3455  264.3368 227.9535 243.8455 1831.244   100

purrr 不是一个性能包，只是方便，这很好，但是当你非常关心性能的时候。这已经讨论过elsewhere。

顺便说一下，如果你擅长C，你应该看看 Rcpp 包。

Answer 2

试试这个：

your_list <- list(list(1,2,3,4),
     list(5,6,7),
     list(8,9))

library(purrr)

first <- your_list %>% map(1)
# [[1]]
# [1] 1
# 
# [[2]]
# [1] 5
# 
# [[3]]
# [1] 8

other <- your_list %>% map(. %>% .[-1])    
# [[1]]
# [[1]][[1]]
# [1] 2
# 
# [[1]][[2]]
# [1] 3
# 
# [[1]][[3]]
# [1] 4
# 
# 
# [[2]]
# [[2]][[1]]
# [1] 6
# 
# [[2]][[2]]
# [1] 7
# 
# 
# [[3]]
# [[3]][[1]]
# [1] 9

虽然您可能需要以下内容，但在我看来，这些数字会更好地存储在向量中而不是列表中：

your_list %>% map(1) %>% unlist # as it seems map_dbl was slow
# [1] 1 5 8
your_list %>% map(~unlist(.x[-1]))
# [[1]]
# [1] 2 3 4
# 
# [[2]]
# [1] 6 7
# 
# [[3]]
# [1] 9

Answer 3

确实，从C到R来说会让人感到困惑（这对我而言）。对性能有帮助的是理解R中的原始类型都是在高度优化的，本机编译的C和Fortran中实现的向量，并且您应该在有可用的矢量化解决方案时避免循环。

那就是说，我认为您应该通过<script th:inline="javascript" type="text/javascript"> $(document).ready(function () { var graphData = '${d3Data}'; alert(graphData); // continued javascript detailed on the example }); </script>将其加载为csv。这将为您提供一个数据框，您可以使用该框架执行基于矢量的操作。

为了更好地理解，简洁（和幽默）的阅读是http://www.burns-stat.com/pages/Tutor/R_inferno.pdf。

Answer 4

我会尝试使用stringr包。像这样：

set.seed(3)
d <- replicate(3, sample(1:1000, 3))
d <- apply(d, 2, function(x) paste(c(x, "\n"), collapse = " "))
d
# [1] "169 807 385 \n" "328 602 604 \n" "125 295 577 \n"


require(stringr)
str_split(d, " ", simplify = T)
# [,1]  [,2]  [,3]  [,4]
# [1,] "169" "807" "385" "\n"
# [2,] "328" "602" "604" "\n"
# [3,] "125" "295" "577" "\n"

即使是大数据也很快：

d <- replicate(1e6, sample(1:1000, 3))
d <- apply(d, 2, function(x) paste(c(x, "\n"), collapse = " "))
d
system.time(s <- str_split(d, " ", simplify = T)) #0.77 sek

Answer 5

假设文件是CSV格式，并且所有“数字”的格式都严格为1 2或-1 2（即，1 2 3或者1 23不允许在文件中），然后可以通过编码开始：

# Install package `data.table` if needed
# install.packages('data.table')

# Load `data.table` package
library(data.table)

# Load the CSV, which has just one column named `my_number`.
# Then, coerce `my_number` into character format and remove negative signs.
DT <- fread('file.csv')[, my_number := as.character(abs(my_number))]

# Extract first character, which would be the first desired digit 
# if my assumption about number formats is correct.
DT[, first_column := substr(my_number, 1, 1)]

# The rest of the substring can go into another column.
DT[, second_column := substr(my_number, 2, nchar(my_number))].

然后，如果您仍然需要创建两个列表，则可以执行以下操作。

# Create the first list.
first_list <- DT[, as.list(first_column)]

# Create the second list.
second_list <- DT[, as.list(second_column)]

快速将列表列表分成两个列表的方法

5 个答案: