我想生成一个数据帧列表,并对每个数据帧应用相同的功能。我不知道如何在没有大量代码行的情况下优雅地执行此操作。
从数据帧df中,
id <- c('a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd', 'e')
x <- rnorm(n = 10, mean = 25, sd = 3)
y <- rnorm(n = 10, mean = 45, sd = 4.5)
z <- rnorm(n = 10, mean = 70000, sd = 10)
type <- c(rep("gold", 2),
rep("silver", 4),
rep("bronze", 4))
df <- data.frame(id, x, y, z, type)
我使用一个基于变量的简单阈值规则创建了一堆其他数据集
df_25 <- df[df$x < 25,]
df_20 <- df[df$x < 20,]
# and so on
然后我将函数应用于每个数据集;我可以单独对每个数据集进行操作,也可以对数据集列表进行操作
# individually
df <- df_18 %>%
dplyr::group_by(id) %>%
dplyr::mutate(nb1= sum(x),
nb2 = sum(x != 25))
# to a list
ls1 <- list(df_25, df_20)
func_1 <- function(x) {
x <- x %>%
dplyr::group_by(id) %>%
dplyr::mutate(nb1= sum(x),
nb2 = sum(x != 25))
}
ls1 <- lapply(ls1, function(x) {x[c("id","x")]
<- lapply(x[c("id","x")], func_1)
x})
df_25 <- ls1[[1]]
df_20 <- ls1[[2]]
在任何情况下,由于我要处理非常大的数据集,这都需要很多行和时间。如何通过上面定义的函数来简化和固定具有正确可识别名称的数据集的生成和新变量的创建?
对于这个双重问题,我还没有找到正确的答案,欢迎您的帮助!
答案 0 :(得分:3)
您可以定义threshold
向量和lapply
聚合。在基数R中,它可能看起来像这样:
threshold <- c(22, 24, 26)
res <- setNames(lapply(threshold, function(s) {
sst <- df[df$x < s, ]
merge(sst,
with(sst, aggregate(list(nb1=x, nb2=x != 25),
by=list(id=id), sum), by="id"))
}), threshold)
res
# $`22`
# id x y z type nb1 nb2
# 1 a 20.92786 37.61272 69976.23 gold 20.92786 1
# 2 b 20.64275 38.02056 69997.25 silver 20.64275 1
# 3 c 18.58916 46.08353 69985.98 silver 18.58916 1
#
# $`24`
# id x y z type nb1 nb2
# 1 a 22.73948 44.29524 70002.81 gold 43.66734 2
# 2 a 20.92786 37.61272 69976.23 gold 43.66734 2
# 3 b 20.64275 38.02056 69997.25 silver 20.64275 1
# 4 c 18.58916 46.08353 69985.98 silver 18.58916 1
#
# $`26`
# id x y z type nb1 nb2
# 1 a 22.73948 44.29524 70002.81 gold 43.66734 2
# 2 a 20.92786 37.61272 69976.23 gold 43.66734 2
# 3 b 20.64275 38.02056 69997.25 silver 20.64275 1
# 4 c 18.58916 46.08353 69985.98 silver 44.24036 2
# 5 c 25.65120 44.85778 70008.81 bronze 44.24036 2
# 6 d 24.84056 49.22505 69993.87 bronze 24.84056 1
数据
df <- structure(list(id = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L,
4L, 5L), .Label = c("a", "b", "c", "d", "e"), class = "factor"),
x = c(22.7394803492982, 20.927856140076, 30.2395154764033,
26.6955462205898, 20.6427460111819, 18.589158456851, 25.6511987559726,
24.8405634272769, 28.8534602413068, 26.5376546472448), y = c(44.2952365501829,
37.6127198429065, 45.2842176546081, 40.3835729432985, 38.0205610647157,
46.083525703352, 44.8577760657779, 49.2250487481642, 40.2699166395278,
49.3740993403725), z = c(70002.8091832317, 69976.2314543058,
70000.9974233725, 70011.435897774, 69997.249180665, 69985.9786882474,
70008.8088326676, 69993.8665395223, 69998.7334115052, 70001.2935411788
), type = structure(c(2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 1L,
1L), .Label = c("bronze", "gold", "silver"), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
答案 1 :(得分:2)
使用purrr::map
遍历阈值向量
library(dplyr)
library(purrr)
map(c(18,20,25) %>%set_names() , ~ df %>% filter(x<.x) %>%
group_by(id) %>%
mutate(nb1= sum(x),
nb2 = sum(x != 25)))
或使用map_if
将计算应用于nrow()>1
的df子集。
map_if(c(18,20,25) %>%set_names(), ~df %>% filter(x<.x) %>% nrow()>1,
~df %>% filter(x<.x) %>% group_by(id) %>%
mutate(nb1= sum(x),
nb2 = sum(x != 25)), .else = ~NA)
答案 2 :(得分:0)
使用tidyverse
,我们可以将所有这些操作组合在一个链中。
library(tidyverse)
df %>%
group_split(x > 25, keep = FALSE) %>%
map(. %>% group_by(id) %>% mutate(nb1= sum(x),nb2 = sum(x != 25)))
#[[1]]
# A tibble: 6 x 7
# Groups: id [5]
# id x y z type nb1 nb2
# <fct> <dbl> <dbl> <dbl> <fct> <dbl> <int>
#1 a 21.4 42.9 70001. gold 21.4 1
#2 b 18.0 45.3 70005. silver 18.0 1
#3 c 23.3 42.7 70006. bronze 23.3 1
#4 d 23.4 40.9 69990. bronze 46.7 2
#5 d 23.3 41.2 70000. bronze 46.7 2
#6 e 22.3 55.9 69991. bronze 22.3 1
#[[2]]
# A tibble: 4 x 7
# Groups: id [3]
# id x y z type nb1 nb2
# <fct> <dbl> <dbl> <dbl> <fct> <dbl> <int>
#1 a 25.8 40.5 69995. gold 25.8 1
#2 b 28.3 41.5 69996. silver 54.5 2
#3 b 26.3 49.3 69993. silver 54.5 2
#4 c 26.5 44.5 69986. silver 26.5 1
在这里,我根据x
的值将数据分为两组,第一组的值小于25,第二组的值大于25。您可以根据需要更改逻辑。
这将为您提供数据帧列表作为输出,您可以单独访问。
数据
set.seed(1234)
id <- c('a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd', 'e')
x <- rnorm(n = 10, mean = 25, sd = 3)
y <- rnorm(n = 10, mean = 45, sd = 4.5)
z <- rnorm(n = 10, mean = 70000, sd = 10)
type <- c(rep("gold", 2),rep("silver", 4),rep("bronze", 4))
df <- data.frame(id, x, y, z, type)