modelr:使用重采样数据拟合多个模型

时间:2016-10-23 15:05:29

标签: r dplyr tidyr purrr

tidy model of data science (TM)中实施的modelr中,重新抽样数据使用list-columns进行整理:

bhathiya@bhathiya-x1:/$ curl -v -k -X OPTIONS https://localhost:8243/userinfo
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8243 (#0)
* found 173 certificates in /etc/ssl/certs/ca-certificates.crt
* found 697 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*    server certificate verification SKIPPED
*    server certificate status verification SKIPPED
*    common name: localhost (matched)
*    server certificate expiration date OK
*    server certificate activation date OK
*    certificate public key: RSA
*    certificate version: #3
*    subject: C=US,ST=CA,L=Mountain View,O=WSO2,CN=localhost
*    start date: Fri, 19 Feb 2010 07:02:26 GMT
*    expire date: Tue, 13 Feb 2035 07:02:26 GMT
*    issuer: C=US,ST=CA,L=Mountain View,O=WSO2,CN=localhost
*    compression: NULL
* ALPN, server did not agree to a protocol
> OPTIONS /userinfo HTTP/1.1
> Host: localhost:8243
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Accept: */*
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: GET
< Host: localhost:8243
< Access-Control-Allow-Headers: authorization,Access-Control-Allow-Origin,Content-Type,SOAPAction
< Date: Sun, 23 Oct 2016 14:43:27 GMT
< Transfer-Encoding: chunked
< 

可以为列表列library(modelr) library(tidyverse) # create the k-folds df_heights_resampled = heights %>% crossv_kfold(k = 10, id = "Resample ID") 中的每个训练数据集map建模,并通过train ping列表列{{}来计算性能指标1}}。

如果需要对多个模型进行此操作,则需要对每个模型重复此操作。

map

给出:

test

问题:

如果要探索的模型数量很大,这可能会非常麻烦。 # create a list of formulas formulas_heights = formulas( .response = ~ income, model1 = ~ height + weight + marital + sex, model2 = ~ height + weight + marital + sex + education ) # fit each of the models in the list of formulas df_heights_resampled = df_heights_resampled %>% mutate( model1 = map(train, function(train_data) { lm(formulas_heights[[1]], data = train_data) }), model2 = map(train, function(train_data) { lm(formulas_heights[[2]], data = train_data) }) ) # score the models on the test sets df_heights_resampled = df_heights_resampled %>% mutate( rmse1 = map2_dbl(.x = model1, .y = test, .f = rmse), rmse2 = map2_dbl(.x = model2, .y = test, .f = rmse) ) 提供> df_heights_resampled # A tibble: 10 × 7 train test `Resample ID` model1 model2 rmse1 rmse2 <list> <list> <chr> <list> <list> <dbl> <dbl> 1 <S3: resample> <S3: resample> 01 <S3: lm> <S3: lm> 58018.35 53903.99 2 <S3: resample> <S3: resample> 02 <S3: lm> <S3: lm> 55117.37 50279.38 3 <S3: resample> <S3: resample> 03 <S3: lm> <S3: lm> 49005.82 44613.93 4 <S3: resample> <S3: resample> 04 <S3: lm> <S3: lm> 55437.07 51068.90 5 <S3: resample> <S3: resample> 05 <S3: lm> <S3: lm> 48845.35 44673.88 6 <S3: resample> <S3: resample> 06 <S3: lm> <S3: lm> 58226.69 54010.50 7 <S3: resample> <S3: resample> 07 <S3: lm> <S3: lm> 56571.93 53322.41 8 <S3: resample> <S3: resample> 08 <S3: lm> <S3: lm> 46084.82 42294.50 9 <S3: resample> <S3: resample> 09 <S3: lm> <S3: lm> 59762.22 54814.55 10 <S3: resample> <S3: resample> 10 <S3: lm> <S3: lm> 45328.48 41882.79 函数,允许迭代多个模型(由多个公式表征),但似乎不允许模型中的modelr列表列以上。我假设fit_with函数族中的一个函数可以实现这一点(train?),但是无法弄清楚如何。

2 个答案:

答案 0 :(得分:3)

您可以使用maplazyeval::interp以编程方式构建呼叫。我很好奇是否有纯purrr解决方案,但问题是您要创建多个列,并且需要多次调用。也许purrr解决方案会创建另一个包含所有模型的列表列。

library(lazyeval)
model_calls <- map(formulas_heights, 
                   ~interp(~map(train, ~lm(form, data = .x)), form = .x))
score_calls <- map(names(model_calls), 
                   ~interp(~map2_dbl(.x = m, .y = test, .f = rmse), m = as.name(.x)))
names(score_calls) <- paste0("rmse", seq_along(score_calls))

df_heights_resampled %>% mutate_(.dots = c(model_calls, score_calls))
# A tibble: 10 × 7
            train           test `Resample ID`   model1   model2    rmse1    rmse2
           <list>         <list>         <chr>   <list>   <list>    <dbl>    <dbl>
1  <S3: resample> <S3: resample>            01 <S3: lm> <S3: lm> 44720.86 41452.07
2  <S3: resample> <S3: resample>            02 <S3: lm> <S3: lm> 54174.38 48823.03
3  <S3: resample> <S3: resample>            03 <S3: lm> <S3: lm> 56854.21 52725.62
4  <S3: resample> <S3: resample>            04 <S3: lm> <S3: lm> 53312.38 48797.48
5  <S3: resample> <S3: resample>            05 <S3: lm> <S3: lm> 61883.90 57469.17
6  <S3: resample> <S3: resample>            06 <S3: lm> <S3: lm> 55709.83 50867.26
7  <S3: resample> <S3: resample>            07 <S3: lm> <S3: lm> 53036.06 48698.07
8  <S3: resample> <S3: resample>            08 <S3: lm> <S3: lm> 55986.83 52717.94
9  <S3: resample> <S3: resample>            09 <S3: lm> <S3: lm> 51738.60 48006.74
10 <S3: resample> <S3: resample>            10 <S3: lm> <S3: lm> 45061.22 41480.35

答案 1 :(得分:0)

受到my own question的启发,我认为这个问题有类似的方法。

首先,定义一个函数,该函数可以在列表 - 列结构中获取数据和公式的参数,并使用输入估计模型:

est_model <- function(data, formula, ...) {
  map(list(data), formula, ...)[[1]]
}

然后直接估计每个CV折叠和公式对的多个模型:

library(gapminder)
library(tidyverse)
library(modelr)

cv_gm <- gapminder %>% 
  crossv_kfold(k = 10, id = "Resample ID")

# Assume 4 different formulae
formulae_tbl <- tibble::frame_data(
  ~model, ~fmla,
  1, ~lm(lifeExp ~ year, data = .),
  2, ~lm(lifeExp ~ year + pop, data = .),
  3, ~lm(lifeExp ~ year + gdpPercap, data = .),
  4, ~lm(lifeExp ~ year + pop + gdpPercap, data = .)
)

cv_gm_results <- cv_gm %>% 
  tidyr::crossing(formulae_tbl)

cv_gm_results <- cv_gm_results %>% 
  mutate(fit=map2(train, fmla, est_model),
         rmse=map2_dbl(fit, test, .f = rmse))

可以说,根据整洁的数据哲学,最好与cv_gm_results一起使用,但如果你想要它在原始问题中的形状(h / t this question):

cv_gm_results %>% 
  select(`Resample ID`, model, fit, rmse) %>% 
  gather(variable, value, fit, rmse) %>%
  unite(temp, variable, model, sep="") %>%
  spread(temp, value) %>% 
  mutate_at(.cols=vars(starts_with("rmse")), .funs=flatten_dbl)

# A tibble: 10 × 9
   `Resample ID`     fit1     fit2     fit3     fit4    rmse1    rmse2
           <chr>   <list>   <list>   <list>   <list>    <dbl>    <dbl>
1             01 <S3: lm> <S3: lm> <S3: lm> <S3: lm> 11.32344 11.32201
2             02 <S3: lm> <S3: lm> <S3: lm> <S3: lm> 11.34626 11.33175
3             03 <S3: lm> <S3: lm> <S3: lm> <S3: lm> 11.62480 11.60221
4             04 <S3: lm> <S3: lm> <S3: lm> <S3: lm> 10.80946 10.81421
5             05 <S3: lm> <S3: lm> <S3: lm> <S3: lm> 11.52413 11.52384
6             06 <S3: lm> <S3: lm> <S3: lm> <S3: lm> 12.10914 12.08134
7             07 <S3: lm> <S3: lm> <S3: lm> <S3: lm> 11.97641 12.00809
8             08 <S3: lm> <S3: lm> <S3: lm> <S3: lm> 12.30191 12.31489
9             09 <S3: lm> <S3: lm> <S3: lm> <S3: lm> 11.96970 11.95617
10            10 <S3: lm> <S3: lm> <S3: lm> <S3: lm> 11.30289 11.30294
# ... with 2 more variables: rmse3 <dbl>, rmse4 <dbl>

<强>更新

事实证明,est_model()不是必需的,purrr提供符合​​我们目的的at_depth()

cv_gm_results <- cv_gm_results %>% 
  mutate(fit=map2(train, fmla, ~at_depth(.x, 0, .y)),
         rmse=map2_dbl(fit, test, .f = rmse))