Question

我有一个存储在长数据帧中的大型数据集。我想提取一些变量的数据，并使用公式来生成新数据。所有必要的信息都应该从公式中提取出来。首先，我想使用公式中的信息来过滤相应变量的数据集 - 我使用docker-compose -f docker-compose.yml logs -f | Out-String -Stream | Select-String "Initialization Complete"函数。我还依赖于CRAN上的all.vars()包。它用于轻松提取等式的左侧和右侧（分别为formula.tools和lhs）。

rhs

新生成的列应命名为library(dplyr) library(reshape2) library(formula.tools) set.seed(100) the_data <- data.frame(country = c(rep("USA", 9), rep("DEU", 9), rep("CHN", 9)), year = c(2000, 2010, 2020), variable = c(rep("GDP", 3), rep("Population", 3), rep("Consumption", 3)), value = rnorm(27, 100, 100)) add_variable <- function(df, equation){ df <- filter(df, variable %in% all.vars(equation)) df <- dcast(df, country + year ~ variable) df <- mutate_(df, rhs(equation)) # code to keep only the newly generated column # ... df <- melt(df, id.vars = c("country", "year")) } result <- add_variable(the_data, GDPpC ~ GDP / Population)，目前称为GDPpC。如何改进？在最后一步中，我还希望过滤数据，以便只有新生成的数据包含在结果中，然后可以通过GDP/Population将其附加到源数据框。

Answer 1

这会是一个解决方案吗？

add_variable <- function(df, equation){
      df <- filter(df, variable %in% all.vars(equation))
      orig_vars <- unique(df$variable)
      df <- dcast(df, country + year ~ variable)

      df <- mutate_(df, rhs(equation))
      colnames(df)[ncol(df)] <- as.character(lhs(equation))

      df <- melt(df, id.vars = c("country", "year"))
      df <- filter(df, !variable%in%orig_vars)
    }

    result <- add_variable(the_data, GDPpC ~ GDP / Population)
    result
  country year variable      value
1     CHN 2000    GDPpC 0.04885649
2     CHN 2010    GDPpC 2.62313658
3     CHN 2020    GDPpC 0.31685382
4     DEU 2000    GDPpC 0.80180998
5     DEU 2010    GDPpC 0.62642877
6     DEU 2020    GDPpC 0.97587188
7     USA 2000    GDPpC 0.26383912
8     USA 2010    GDPpC 1.01303516
9     USA 2020    GDPpC 0.69851501

Answer 2

很多年后，我在这里寻找有关在dplyr::mutate中使用公式的信息，因为我经常发现这样做更加简洁明了。 dplyr当然自2016年以来就有所增长和变化，其中包括transmute现在已过时的事实。但好消息是，如果您愿意使用formula.tools，则解决方案非常简洁。如下所示。

library(dplyr)

# reproducible play_data

set.seed(2020)
play_data <- 
  data.frame(
    a = runif(20, 0.01, .5),
    b = runif(20, 0.02, .5),
    c = runif(20, 0.03, .5),
    d = runif(20, 0.04, .5),
    e = runif(20,1,5),
    f = runif(20,10,50)
  )

my_formula <- newvariable ~ a * b^c / d * log(e) - f

require(formula.tools)

mutate_by_formula <- function(df, equation){
  df %>% transmute( !!lhs(equation) := !!rhs(equation) )
}

mutate_by_formula(play_data, my_formula)
#>    newvariable
#> 1    -25.80405
#> 2    -20.48974
#> 3    -37.87361
#> 4    -46.52231
#> 5    -19.88420
#> 6    -16.49153
#> 7    -37.25498
#> 8    -41.02025
#> 9    -31.88338
#> 10   -42.17896
#> 11   -30.75905
#> 12   -10.42447
#> 13   -25.84538
#> 14   -46.08206
#> 15   -13.51940
#> 16   -25.30124
#> 17   -19.80536
#> 18   -26.42881
#> 19   -38.02190
#> 20   -30.51113

对于OP的原始示例和细节，由于必须对数据进行整形，但基本概念相同，因此稍微复杂些。唯一的转折是dcast和最后的select，用于删除计算中使用的变量。

library(dplyr)
library(reshape2)
library(formula.tools)

set.seed(100)

the_data <- data.frame(country = c(rep("USA", 9), rep("DEU", 9), rep("CHN", 9)),
                       year    = c(2000, 2010, 2020),
                       variable = c(rep("GDP", 3), rep("Population", 3), rep("Consumption", 3)),
                       value = rnorm(27, 100, 100))

specific_function <- function(df, equation){
  df %>% 
    filter(variable %in% all.vars(equation)) %>%
    dcast(country + year ~ variable) %>%
    mutate(!!lhs(equation) := !!rhs(equation)) %>%
    select(-all.vars(equation)[2:length(all.vars(equation))])
}

specific_function(the_data, GDPpC ~ GDP / Population)
#>   country year      GDPpC
#> 1     CHN 2000 0.04885649
#> 2     CHN 2010 2.62313658
#> 3     CHN 2020 0.31685382
#> 4     DEU 2000 0.80180998
#> 5     DEU 2010 0.62642877
#> 6     DEU 2020 0.97587188
#> 7     USA 2000 0.26383912
#> 8     USA 2010 1.01303516
#> 9     USA 2020 0.69851501

^{由reprex package（v0.3.0）于2020-05-04创建}

让dplyr mutate使用公式

2 个答案: