使用dplyr

时间:2019-06-30 10:01:54

标签: r dplyr linear-regression lm

我正在尝试使用略有不同的公式一次运行许多多元回归。我在这里找到了一个很好的例子:https://rpubs.com/Marcelobn/many_regressions

但是,我无法完全为每个回归运行不同的公式...我正在寻求帮助来修复更新的代码或提供替代方法。预先谢谢你!

我正在使用R Studio,并在下面突出显示了我已经尝试过的内容(示例2)。


library(pwt)
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
library(pander)


example <- pwt7.1

# This works great, and I still want an output like this:
multiple_growth <- example %>% select(country, openc, cg, cgdp) %>% 
  na.omit() %>%
  nest(-country) %>%
  mutate(model = map(data, ~lm(cgdp ~ openc + cg, data = .)),
         tidied = map(model, tidy)) %>%
  unnest(tidied) 

# BUT: it assumes each of the models for each country are the same
# I want to specify different formulas for each one
example2 <- example

# I have randomly assigned them for the purpose of this example
# In reality I get to this a more methodical way!
formula1 <- paste("cgdp", "~", "openc", "+", "cg", sep = " ")
formula2 <- paste("cgdp", "~", "openc", "+", "cg", "+", "currency", "+", "ppp", sep = " ")
formula3 <- paste("cgdp", "~", "pg", "+", "kg", "+", "openc", sep = " ")


randvar = sample(c(formula1,formula2,formula3), size = nrow(example2), replace = TRUE)
example2$regress = randvar




# Run model again with slight change to lm, and it kind of works
multiple_growth_2 <- example2 %>% select(country, openc, cg, cgdp, currency, ppp, pg, kg, regress) %>% 
  na.omit() %>%
  nest(-country, -regress) %>%
  mutate(model = map(data, ~lm(as.formula(regress), data = .)), # here is where i have tried to change it
         tidied = map(model, tidy)) %>%
  unnest(tidied) 

# This kind of works but it uses the first formula for ALL of the other countries... Any idea how to fix / an alternate method?

我想要一个类似的输出,但是对于每个变量,使用正确的公式进行回归,而不仅仅是所有列表中的第一个...

2 个答案:

答案 0 :(得分:1)

使用map2遍历公式和数据框:

multiple_growth_2 <- example2 %>%
    select(country, openc, cg, cgdp, currency, ppp, pg, kg, regress) %>% 
    na.omit() %>%
    nest(-country, -regress) %>% 
    mutate(model = map2(data, regress, ~ lm(as.formula(.y), data = .x)), 
           tidied = map(model, tidy)) %>%
    unnest(tidied)

您还应该从formula2中删除“货币”。您会嵌套在国家/地区上,因此大多数(如果不是全部)数据框将只包含一种货币,但是至少需要两个要素水平(即货币)才能形成对比。

答案 1 :(得分:1)

由于您正在对整个数据集进行模型训练,因此可以将公式(或模型)选择为单独的对象,并在以后使用tidyr::crossing添加它们:

library(pwt, quietly = TRUE, warn.conflicts = FALSE)
library(dplyr, quietly = TRUE, warn.conflicts = FALSE)
library(tidyr)
library(purrr)
library(broom)

example <- as_tibble(pwt7.1)

formulas <- c(
        formula1 =  paste("cgdp", "~", "openc", "+", "cg", sep = " "),
        formula2 =  paste("cgdp", "~", "openc", "+", "cg", "+", "ppp", sep = " "),
        formula3 =  paste("cgdp", "~", "pg", "+", "kg", "+", "openc", sep = " ")
)

multiple_growth_2 <- example %>%
        select(country, openc, cg, cgdp, currency, ppp, pg, kg) %>% 
        na.omit() %>%
        nest(-country) %>%
        tidyr::crossing(. , formulas) %>% 
        mutate(model = pmap(list(x = data, y = formulas), function(x, y) lm( as.formula(y), data = x))
        )

# --- Use broom to

# evaluate models
multiple_growth_2 %>% 
        mutate(model_glance = map(model, glance) ) %>% 
        unnest(model_glance) %>% 
        select(-data, -model)
#> # A tibble: 570 x 13
#>    country formulas r.squared adj.r.squared sigma statistic  p.value    df
#>    <fct>   <chr>        <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>
#>  1 Afghan~ cgdp ~ ~    0.550         0.527   179.     23.2  2.56e- 7     3
#>  2 Afghan~ cgdp ~ ~    0.551         0.514   181.     15.1  1.39e- 6     4
#>  3 Afghan~ cgdp ~ ~    0.599         0.567   171.     18.5  1.74e- 7     4
#>  4 Albania cgdp ~ ~    0.519         0.494  1247.     20.5  9.17e- 7     3
#>  5 Albania cgdp ~ ~    0.746         0.726   917.     36.3  4.09e-11     4
#>  6 Albania cgdp ~ ~    0.626         0.596  1114.     20.7  4.93e- 8     4
#>  7 Algeria cgdp ~ ~    0.0754        0.0368 1916.      1.96 1.52e- 1     3
#>  8 Algeria cgdp ~ ~    0.824         0.813   844.     73.5  9.02e-18     4
#>  9 Algeria cgdp ~ ~    0.482         0.449  1449.     14.6  7.58e- 7     4
#> 10 Angola  cgdp ~ ~    0.581         0.559   971.     26.4  6.56e- 8     3
#> # ... with 560 more rows, and 5 more variables: logLik <dbl>, AIC <dbl>,
#> #   BIC <dbl>, deviance <dbl>, df.residual <int>

# check coefficient
multiple_growth_2 %>%
        mutate(model_tidy = map(model, tidy) ) %>% 
        unnest(model_tidy)
#> # A tibble: 2,089 x 7
#>    country   formulas        term    estimate std.error statistic   p.value
#>    <fct>     <chr>           <chr>      <dbl>     <dbl>     <dbl>     <dbl>
#>  1 Afghanis~ cgdp ~ openc +~ (Inter~   255.       77.7      3.28    2.21e-3
#>  2 Afghanis~ cgdp ~ openc +~ openc      -5.03      1.09    -4.60    4.63e-5
#>  3 Afghanis~ cgdp ~ openc +~ cg         70.0      10.3      6.80    4.55e-8
#>  4 Afghanis~ cgdp ~ openc +~ (Inter~   230.      130.       1.78    8.38e-2
#>  5 Afghanis~ cgdp ~ openc +~ openc      -4.82      1.40    -3.45    1.41e-3
#>  6 Afghanis~ cgdp ~ openc +~ cg         72.7      15.3      4.76    2.92e-5
#>  7 Afghanis~ cgdp ~ openc +~ ppp        -1.88      7.79    -0.241   8.11e-1
#>  8 Afghanis~ cgdp ~ pg + kg~ (Inter~   452.      101.       4.46    7.38e-5
#>  9 Afghanis~ cgdp ~ pg + kg~ pg         -6.11      2.40    -2.54    1.53e-2
#> 10 Afghanis~ cgdp ~ pg + kg~ kg         64.2       9.67     6.63    8.76e-8
#> # ... with 2,079 more rows

# check individual prediction
multiple_growth_2 %>%
        mutate(model_augment = map(model, augment) ) %>% 
        unnest(model_augment)
#> # A tibble: 26,820 x 15
#>    country formulas  cgdp openc    cg .fitted .se.fit .resid   .hat .sigma
#>    <fct>   <chr>    <dbl> <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
#>  1 Afghan~ cgdp ~ ~  247.  21.7  5.28    515.    42.5  -267. 0.0562   176.
#>  2 Afghan~ cgdp ~ ~  241.  27.1  5.73    520.    39.3  -278. 0.0481   175.
#>  3 Afghan~ cgdp ~ ~  240.  32.9  6.11    517.    36.7  -277. 0.0419   176.
#>  4 Afghan~ cgdp ~ ~  273.  27.7  5.74    518.    39.1  -245. 0.0476   177.
#>  5 Afghan~ cgdp ~ ~  324.  28.9  5.36    485.    40.7  -160. 0.0517   180.
#>  6 Afghan~ cgdp ~ ~  363.  26.9  6.99    609.    36.2  -246. 0.0408   177.
#>  7 Afghan~ cgdp ~ ~  410.  28.1  6.60    576.    36.3  -167. 0.0409   179.
#>  8 Afghan~ cgdp ~ ~  441.  26.5  6.97    610.    36.4  -169. 0.0413   179.
#>  9 Afghan~ cgdp ~ ~  487.  24.7  7.08    626.    37.3  -139. 0.0434   180.
#> 10 Afghan~ cgdp ~ ~  505.  26.4  7.07    617.    36.4  -112. 0.0413   181.
#> # ... with 26,810 more rows, and 5 more variables: .cooksd <dbl>,
#> #   .std.resid <dbl>, ppp <dbl>, pg <dbl>, kg <dbl>

注意:我使用purrr::pmap是为了提供不同的答案(purrr::map2也可以完成!)。