我想基于一系列31个变量对五个数据帧进行子集化。数据帧存储在列表中:
long_data_sets <- list(scale_g1, scale_g2, scale_g3, scale_g4, scale_g5)
五个数据帧中的每个数据帧都包含相同的列集,其中包括31个称为“ speeder_225”到“ speeder_375”的因素:
> str(scale_g1[53:83])
'data.frame': 5522 obs. of 31 variables:
$ speeder_225: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_230: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_235: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_240: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_245: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
...
我想一次基于31个因子变量之一对数据帧进行子集,以便最终获得5 * 31个新数据帧。
我创建了用于子设置的功能,该功能仅保留我需要继续的两列(“方向”和“响应”):
create_speeder_data <- function(x, y){
df <- subset(x, x[,y] == "Speeder",
select = c("direction", "response"))
}
这使我可以一次创建一个新的数据框:
create_speeder_data(scale_g1, "speeder_225")
我尝试使用map2()和5个数据框的列表以及31个因子名称的列表来应用该函数,但这显然行不通。
> speeder_var <- names(scale_g1[53:83])
> map2(long_data_sets, speeder_var, create_speeder_data)
Error: `.x` (5) and `.y` (31) are different lengths
我能得到的最接近的结果是从函数中取出y参数并将该函数应用于31个因子之一的五个数据帧的列表中。
#Create subsetting function for "speeder_225"
create_speeder_225_data <- function(x){
df <- subset(x, x$speeder_225 == "Speeder",
select = c("direction", "response"))
}
#Map function to list of data frames
z_speeder_225 <- map(long_data_sets, create_speeder_225_data)
#Change names of new data frames in list
names(long_data_sets) <- c("g1", "g2", "g3", "g4", "g5")
names(z_speeder_225) <- paste0(names_long_data_sets, "speeder_225")
#Get data frames from list
list2env(z_speeder_225, envir=.GlobalEnv)
我需要再重复30次才能获得5 * 31数据帧。必须有一种更简单的方法来做到这一点。
非常感谢您的帮助!
答案 0 :(得分:0)
我同意@ Mako212-您可能需要重新考虑您要尝试做的事情。但是,这应该可行。
以下代码将子集中列表中的每个数据集。在测试数据中,有5个类别变量,每个类别变量都有两个级别。由于otuput仅基于1级(speeding
),因此输出为5 x 5 = 25个数据集。这是组织为列表列表(5 x 5):
library(data.table)
# Creating some dummy data
k <- 100
directions <- as.vector(sapply(c('North', 'West', 'South', 'East'), function (z) return(rep(z, k))))
speeding <- as.vector(sapply(c('speeding', 'not-speeding'), function (z) return(rep(z, k))))
# Test data - number_of_observations <= 4*k
createDataTable <- function(number_of_observations = 50){
dt <- data.table(direction = sample(x = directions, size = number_of_observations, replace = T),
speeder1 = sample(x = speeding, size = number_of_observations, replace = T),
speeder2 = sample(x = speeding, size = number_of_observations, replace = T),
speeder3 = sample(x = speeding, size = number_of_observations, replace = T),
speeder4 = sample(x = speeding, size = number_of_observations, replace = T),
speeder5 = sample(x = speeding, size = number_of_observations, replace = T))
}
data_list <- lapply(X = floor(runif(n = 5, min = 50, max = 4*k)),
FUN = function(z){createDataTable(z)})
# Subset dummy data based on one column at a time and return
# the number of observations, direction, speeder2 and speeder3 from the subset
cols <- sapply(1:5, function(z) paste('speeder',z,sep = ""))
ret <- lapply(cols, function(z){
lapply(data_list, function(x){
return(x[get(z) == 'speeding', .(nrows = .N, direction, speeder2, speeder3)])
})
})
ret
的结构与我们期望的一致。每个项目都是5个data.table
对象的列表,每个对象有4列。
> summary(ret)
Length Class Mode
[1,] 5 -none- list
[2,] 5 -none- list
[3,] 5 -none- list
[4,] 5 -none- list
[5,] 5 -none- list
> summary(ret[[1]])
Length Class Mode
[1,] 4 data.table list
[2,] 4 data.table list
[3,] 4 data.table list
[4,] 4 data.table list
[5,] 4 data.table list
一种快速测试,看代码是否运行良好,并且没有错误设置子集,这是一个简化的调用,其中仅包含子集条件的观察次数/行数:
> unlist(lapply(cols, function(z){
+ lapply(data_list, function(x){
+ return(x[get(z) == 'speeding', .(nrows = .N)])
+ })
+ }))
nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows
113 82 24 112 185 97 63 22 110 193 103 78 35 115 197 110 74
nrows nrows nrows nrows nrows nrows nrows nrows
26 103 194 107 84 25 97 191