生成数据框列表并应用功能

时间:2019-05-09 09:03:50

标签: r dataframe for-loop lapply

我想生成一个数据帧列表,并对每个数据帧应用相同的功能。我不知道如何在没有大量代码行的情况下优雅地执行此操作。

从数据帧df中,

id <- c('a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd', 'e')
x <- rnorm(n = 10, mean = 25, sd = 3)
y <- rnorm(n = 10, mean = 45, sd = 4.5)
z <- rnorm(n = 10, mean = 70000, sd = 10)
type <- c(rep("gold", 2),
            rep("silver", 4),
            rep("bronze", 4))
df <- data.frame(id, x, y, z, type)

我使用一个基于变量的简单阈值规则创建了一堆其他数据集

df_25 <- df[df$x < 25,]
df_20 <- df[df$x < 20,] 
# and so on

然后我将函数应用于每个数据集;我可以单独对每个数据集进行操作,也可以对数据集列表进行操作

# individually
df <- df_18 %>%
  dplyr::group_by(id) %>%
  dplyr::mutate(nb1= sum(x),
                nb2 = sum(x != 25))

# to a list 
ls1 <- list(df_25, df_20)

func_1 <- function(x) {
  x <- x %>%
    dplyr::group_by(id) %>%
    dplyr::mutate(nb1= sum(x),
                nb2 = sum(x != 25))
}

ls1 <- lapply(ls1, function(x) {x[c("id","x")] 
  <- lapply(x[c("id","x")], func_1)
  x})


df_25 <- ls1[[1]]

df_20 <- ls1[[2]]

在任何情况下,由于我要处理非常大的数据集,这都需要很多行和时间。如何通过上面定义的函数来简化和固定具有正确可识别名称的数据集的生成和新变量的创建?

对于这个双重问题,我还没有找到正确的答案,欢迎您的帮助!

3 个答案:

答案 0 :(得分:3)

您可以定义threshold向量和lapply聚合。在基数R中,它可能看起来像这样:

threshold <- c(22, 24, 26)

res <- setNames(lapply(threshold, function(s) {
  sst <- df[df$x < s, ]
  merge(sst, 
        with(sst, aggregate(list(nb1=x, nb2=x != 25), 
                            by=list(id=id), sum), by="id"))
}), threshold)

res
# $`22`
#   id        x        y        z   type      nb1 nb2
# 1  a 20.92786 37.61272 69976.23   gold 20.92786   1
# 2  b 20.64275 38.02056 69997.25 silver 20.64275   1
# 3  c 18.58916 46.08353 69985.98 silver 18.58916   1
# 
# $`24`
#   id        x        y        z   type      nb1 nb2
# 1  a 22.73948 44.29524 70002.81   gold 43.66734   2
# 2  a 20.92786 37.61272 69976.23   gold 43.66734   2
# 3  b 20.64275 38.02056 69997.25 silver 20.64275   1
# 4  c 18.58916 46.08353 69985.98 silver 18.58916   1
# 
# $`26`
#   id        x        y        z   type      nb1 nb2
# 1  a 22.73948 44.29524 70002.81   gold 43.66734   2
# 2  a 20.92786 37.61272 69976.23   gold 43.66734   2
# 3  b 20.64275 38.02056 69997.25 silver 20.64275   1
# 4  c 18.58916 46.08353 69985.98 silver 44.24036   2
# 5  c 25.65120 44.85778 70008.81 bronze 44.24036   2
# 6  d 24.84056 49.22505 69993.87 bronze 24.84056   1

数据

df <- structure(list(id = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 
4L, 5L), .Label = c("a", "b", "c", "d", "e"), class = "factor"), 
    x = c(22.7394803492982, 20.927856140076, 30.2395154764033, 
    26.6955462205898, 20.6427460111819, 18.589158456851, 25.6511987559726, 
    24.8405634272769, 28.8534602413068, 26.5376546472448), y = c(44.2952365501829, 
    37.6127198429065, 45.2842176546081, 40.3835729432985, 38.0205610647157, 
    46.083525703352, 44.8577760657779, 49.2250487481642, 40.2699166395278, 
    49.3740993403725), z = c(70002.8091832317, 69976.2314543058, 
    70000.9974233725, 70011.435897774, 69997.249180665, 69985.9786882474, 
    70008.8088326676, 69993.8665395223, 69998.7334115052, 70001.2935411788
    ), type = structure(c(2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 
    1L), .Label = c("bronze", "gold", "silver"), class = "factor")), class = "data.frame", row.names = c(NA, 
-10L))

答案 1 :(得分:2)

使用purrr::map遍历阈值向量

library(dplyr)
library(purrr)
map(c(18,20,25) %>%set_names() , ~ df %>% filter(x<.x) %>% 
                          group_by(id) %>%
                          mutate(nb1= sum(x),
                          nb2 = sum(x != 25)))

或使用map_if将计算应用于nrow()>1的df子集。

map_if(c(18,20,25) %>%set_names(), ~df %>% filter(x<.x) %>% nrow()>1,
                    ~df %>% filter(x<.x) %>% group_by(id) %>%
                            mutate(nb1= sum(x),
                            nb2 = sum(x != 25)), .else = ~NA)

答案 2 :(得分:0)

使用tidyverse,我们可以将所有这些操作组合在一个链中。

library(tidyverse)

df %>%
  group_split(x > 25, keep = FALSE) %>%
  map(. %>% group_by(id) %>% mutate(nb1= sum(x),nb2 = sum(x != 25)))


#[[1]]
# A tibble: 6 x 7
# Groups:   id [5]
#  id        x     y      z type     nb1   nb2
#  <fct> <dbl> <dbl>  <dbl> <fct>  <dbl> <int>
#1 a      21.4  42.9 70001. gold    21.4     1
#2 b      18.0  45.3 70005. silver  18.0     1
#3 c      23.3  42.7 70006. bronze  23.3     1
#4 d      23.4  40.9 69990. bronze  46.7     2
#5 d      23.3  41.2 70000. bronze  46.7     2
#6 e      22.3  55.9 69991. bronze  22.3     1

#[[2]]
# A tibble: 4 x 7
# Groups:   id [3]
#  id        x     y      z type     nb1   nb2
#  <fct> <dbl> <dbl>  <dbl> <fct>  <dbl> <int>
#1 a      25.8  40.5 69995. gold    25.8     1
#2 b      28.3  41.5 69996. silver  54.5     2
#3 b      26.3  49.3 69993. silver  54.5     2
#4 c      26.5  44.5 69986. silver  26.5     1

在这里,我根据x的值将数据分为两组,第一组的值小于25,第二组的值大于25。您可以根据需要更改逻辑。

这将为您提供数据帧列表作为输出,您可以单独访问。

数据

set.seed(1234)
id <- c('a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd', 'e')
x <- rnorm(n = 10, mean = 25, sd = 3)
y <- rnorm(n = 10, mean = 45, sd = 4.5)
z <- rnorm(n = 10, mean = 70000, sd = 10)
type <- c(rep("gold", 2),rep("silver", 4),rep("bronze", 4))
df <- data.frame(id, x, y, z, type)