Question

我想基于一系列31个变量对五个数据帧进行子集化。数据帧存储在列表中：

long_data_sets <- list(scale_g1, scale_g2, scale_g3, scale_g4, scale_g5)

五个数据帧中的每个数据帧都包含相同的列集，其中包括31个称为“ speeder_225”到“ speeder_375”的因素：

> str(scale_g1[53:83])
'data.frame':   5522 obs. of  31 variables:
$ speeder_225: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_230: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_235: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_240: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_245: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ... 
...

我想一次基于31个因子变量之一对数据帧进行子集，以便最终获得5 * 31个新数据帧。

我创建了用于子设置的功能，该功能仅保留我需要继续的两列（“方向”和“响应”）：

create_speeder_data <- function(x, y){
  df <- subset(x, x[,y] == "Speeder",
              select = c("direction", "response"))
}

这使我可以一次创建一个新的数据框：

create_speeder_data(scale_g1, "speeder_225")

我尝试使用map2（）和5个数据框的列表以及31个因子名称的列表来应用该函数，但这显然行不通。

> speeder_var <- names(scale_g1[53:83])
> map2(long_data_sets, speeder_var, create_speeder_data)
Error: `.x` (5) and `.y` (31) are different lengths

我能得到的最接近的结果是从函数中取出y参数并将该函数应用于31个因子之一的五个数据帧的列表中。

#Create subsetting function for "speeder_225"
create_speeder_225_data <- function(x){
  df <- subset(x, x$speeder_225 == "Speeder",
               select = c("direction", "response"))
}

#Map function to list of data frames
z_speeder_225 <- map(long_data_sets, create_speeder_225_data)

#Change names of new data frames in list
names(long_data_sets) <- c("g1", "g2", "g3", "g4", "g5")
names(z_speeder_225) <- paste0(names_long_data_sets, "speeder_225")

#Get data frames from list
list2env(z_speeder_225, envir=.GlobalEnv)

我需要再重复30次才能获得5 * 31数据帧。必须有一种更简单的方法来做到这一点。

非常感谢您的帮助！

Answer 1

我同意@ Mako212-您可能需要重新考虑您要尝试做的事情。但是，这应该可行。

以下代码将子集中列表中的每个数据集。在测试数据中，有5个类别变量，每个类别变量都有两个级别。由于otuput仅基于1级（speeding），因此输出为5 x 5 = 25个数据集。这是组织为列表列表（5 x 5）：

library(data.table)

# Creating some dummy data
k  <- 100
directions <- as.vector(sapply(c('North', 'West', 'South', 'East'), function (z) return(rep(z, k))))
speeding <- as.vector(sapply(c('speeding', 'not-speeding'), function (z) return(rep(z, k))))

# Test data - number_of_observations <= 4*k
createDataTable <- function(number_of_observations = 50){
  dt <- data.table(direction = sample(x = directions, size = number_of_observations, replace = T), 
                   speeder1 = sample(x = speeding, size = number_of_observations, replace = T), 
                   speeder2 = sample(x = speeding, size = number_of_observations, replace = T),
                   speeder3 = sample(x = speeding, size = number_of_observations, replace = T),
                   speeder4 = sample(x = speeding, size = number_of_observations, replace = T),
                   speeder5 = sample(x = speeding, size = number_of_observations, replace = T))
}

data_list <- lapply(X = floor(runif(n = 5, min = 50, max = 4*k)), 
                    FUN = function(z){createDataTable(z)})

# Subset dummy data based on one column at a time and return 
# the number of observations, direction, speeder2 and speeder3 from the subset 
cols <- sapply(1:5, function(z) paste('speeder',z,sep = ""))

ret <- lapply(cols, function(z){
  lapply(data_list, function(x){
    return(x[get(z) == 'speeding', .(nrows = .N, direction, speeder2, speeder3)])
  })
})

ret的结构与我们期望的一致。每个项目都是5个data.table对象的列表，每个对象有4列。

> summary(ret)
     Length Class  Mode
[1,] 5      -none- list
[2,] 5      -none- list
[3,] 5      -none- list
[4,] 5      -none- list
[5,] 5      -none- list
> summary(ret[[1]])
     Length Class      Mode
[1,] 4      data.table list
[2,] 4      data.table list
[3,] 4      data.table list
[4,] 4      data.table list
[5,] 4      data.table list

一种快速测试，看代码是否运行良好，并且没有错误设置子集，这是一个简化的调用，其中仅包含子集条件的观察次数/行数：

> unlist(lapply(cols, function(z){
+     lapply(data_list, function(x){
+         return(x[get(z) == 'speeding', .(nrows = .N)])
+     })
+ }))
nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows 
  113    82    24   112   185    97    63    22   110   193   103    78    35   115   197   110    74 
nrows nrows nrows nrows nrows nrows nrows nrows 
   26   103   194   107    84    25    97   191

如何使用多个变量对多个数据帧进行子集

1 个答案: