我有以下数据框和dplyr方法来过滤和变异:
library(tidyverse)
infile <- "https://nopaste.me/view/raw/767f65cf" # this link will exist forever
gene_list <- c("ITGAM","ARG1")
dat <- read_delim(infile,delim=",", col_types = cols()) %>%
mutate(log_TPM = log(TPM)) %>%
filter(gene_symbol %in% gene_list)
dat
#> # A tibble: 236 × 5
#> gene_symbol sample_id TPM category log_TPM
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 ARG1 SPL_128 2.32 Medication- 0.8415672
#> 2 ITGAM SPL_128 14.92 Medication- 2.7027026
#> 3 ARG1 SPL_129 1.14 Medication- 0.1310283
#> 4 ITGAM SPL_129 17.49 Medication- 2.8616293
#> 5 ARG1 SPL_130 8.02 Medication- 2.0819384
#> 6 ITGAM SPL_130 3.65 Medication- 1.2947272
#> 7 ARG1 SPL_131 0.81 Medication- -0.2107210
#> 8 ITGAM SPL_131 1.81 Medication- 0.5933268
#> 9 ARG1 SPL_132 0.00 Medication- -Inf
#> 10 ITGAM SPL_132 1.41 Medication- 0.3435897
#> # ... with 226 more rows
实际上它包含大约500万行而不是236行。 使用dplyr非常慢。什么是data.table的做法?
答案 0 :(得分:2)
命令的data.table
版本应为
require(data.table)
infile <- "https://nopaste.me/view/raw/767f65cf"
gene_list <- c("ITGAM","ARG1")
dat <- fread(infile)
dat <- dat[gene_symbol %in% gene_list]
dat[,log_TPM := log(TPM)]
如果有帮助,请告诉我。
答案 1 :(得分:2)
data.table
方法是:
library(data.table)
d <- read_delim(infile,delim=",", col_types = cols())
setDT(d)
d <- d[gene_symbol %in% gene_list, ][, log_TPM := log(TPM)]
话虽如此,它至少在我的机器上并没有真正改善性能。两者都需要大约半秒钟,这并不令人惊讶,因为一些瓶颈是gene_symbol %in% gene_list
和log(TPM)
语句(在较小程度上)。
# create a 7-million row version of the sample data
large_data <- purrr::map_df(1:20000, ~ d)
large_data_dt <- as.data.table(large_data)
library(microbenchmark)
microbenchmark(
dplyr = large_data %>%
filter(gene_symbol %in% gene_list) %>%
mutate(log_TPM = log(TPM)),
dt = large_data_dt[gene_symbol %in% gene_list, ][, log_TPM := log(TPM)],
times = 20
)
我的机器上的结果:
Unit: milliseconds
expr min lq mean median uq max neval cld
dplyr 364.2026 446.1865 494.3292 476.0633 533.4779 835.1898 20 a
dt 385.1917 448.6515 550.0030 492.5638 592.3481 946.6732 20 a
你是什么意思&#34;非常慢&#34;?你确定它不是另一个缓慢的步骤吗?