Question

给定的脚本根据给定的条件将R中的虹膜数据设置为子集。然而，当相同的数据应用于1000万以上的非常大的数据时，解决方案就是折腾。请帮助我以更快的方式在更短的时间内获得相同的结果。感谢。

iris1 <- subset(iris,iris$Sepal.Length<=5 & iris$Sepal.Length >= 3)

Answer 1

您是否尝试过data.table？

执行此操作的标准方法是：

library(data.table)
iris <- iris
setDT(iris)
# you could try here "setkey(iris, Sepal.Length)" but shouldn't matter
iris1 <- iris[Sepal.Length %between% c(3, 5)]

Answer 2

我已经进行了以下基准测试，使用三种差异方法比较子集：基于R的subset函数，基于示例数据框的dplyr和data.table 10000050行。结果显示data.table是最快的。

library(dplyr)
library(data.table)
library(microbenchmark)

# Create example data frame with 10000050 rows
dat <- iris[rep(1:150, times = 66667), ]

# Create example tibble
dat_tbl <- as_tibble(dat)

# Create example data.table
dat_dt <- as.data.table(dat)

# Conduct benchmarking
per <- microbenchmark(# Method 1: The base R subset function
                      m1 = {subset(dat, Sepal.Length <= 5 & Sepal.Length >= 3)},
                      # Method 2: The dplyr method
                      m2 = {dat_tbl %>% filter(Sepal.Length <= 5, Sepal.Length >= 3)},
                      # Method 3: The data.table method
                      m3 = {dat_dt[Sepal.Length %between% c(3, 5), ]}, times = 1000L)

per
# Unit: milliseconds
# expr       min        lq     mean    median        uq      max neval
#   m1 381.17382 776.68467 961.2284 849.74441 1144.4295 2384.084  1000
#   m2 115.79736 137.92646 295.3928 146.91114  511.5023 1452.713  1000
#   m3  50.05212  63.57618 201.6751  70.98142  413.5645 1227.321  1000

# Plot the result
library(ggplot2)
autoplot(per)

Answer 3

对数据帧进行子集化的最快方法当然是使用-data.table-包将其转换为数据表。

# Load -data.table- package
library(data.table)

# Convert iris to a data table
dt <- as.data.table(iris)

# Subset data table using conditions on Sepal.Length
dt1 <- dt[Sepal.Length<=5 & Sepal.Length >= 3]

在R中自动化子集命令

3 个答案: