在用Trycatch或安全包装时,扫帚无法处理嵌套的lm模型列表

时间:2019-08-05 10:42:50

标签: r dplyr tidyverse lm broom

我一直在尝试在大型数据集中拟合线性模型(每组约3千4百万个组)。由于我的数据集的绝对大小,在某些情况下,模型拟合将不可用。因此,我在lm模型拟合中遇到与NA相关的错误,例如:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  0 (non-NA) cases

当我尝试将lm包裹在TryCatch或Salely周围以在拟合过程中捕获并处理随机错误时,就会出现问题

到目前为止,我尝试了几种方法。在这里,我给您一个虚拟数据集:

为避免此错误,我尝试了几种方法。您可能会说,最简单的方法是对数据进行分组过滤,并排除所有充满NA的数据集(我也曾尝试这样做,但显然有些数据集存在其他问题,并导致我的模型拟合遇到错误)。

因此,由于我想拥有一个适用于maany类型数据的健壮流程,因此我试图编写一种相当通用且具有故障保护功能的东西。

此外,我希望使用multidplyr进行并行化,所以我采用了dplyr的方式。这是现在的问题。一旦绕过tryCatch或安全地绕过lm,即使以后我取消嵌套,扫帚:tidy也无法处理结果。 (这是一个很大的遗憾,因为它使我的生活变得很轻松,而不必一直都在编写代码)

情况1:简单的lm适合标准方式,再加上扫帚。一切都很好

library(tidyverse)

df.h= tibble(
  hour  =  factor(c("1_1","1_1","1_1")),
  price =  c(3.235536,3.205588, 3.235930),
  wind  =  c(-2.302585, 3.871201, 5.123964)
)

dfHour = df.h2 %>% group_by(hour) %>%
  do(fitHour = lm(price ~ wind, data = .))

dfHourCoef = broom::tidy(dfHour, fitHour)

引入故障数据集以在lm中产生错误:

library(tidyverse)

df.h2= tibble(
  hour  =  factor(c("1_1","1_1","1_1","1_2","1_2","1_2")),
  price =  c(3.235536,3.205588, 3.235930, 3.235536,3.205588, 3.235930),
  wind  =  c(-2.302585, 3.871201, 5.123964, NA, NA, NA)
)


dfHour2 = df.h2 %>% group_by(hour) %>%
  do(fitHour = tryCatch( lm(myy ~ myx, data = . ), error= function(e){return("FAILURE")} ) ) %>%
  filter(!is.character(fitHour)) # Exploit the fact that all good outputs are 
                                 # a list while faulty output is a character
                                 # to perform filtering

# get the coefficients by group in a tidy data_frame
dfHourCoef2 = broom::tidy(dfHour2, fitHour)

这会产生以下错误:

Error in .[[object]][[1]] : subscript out of bounds

方法三:安全包装以捕获错误消息


library(tidyverse)

df.h2= tibble(
  hour  =  factor(c("1_1","1_1","1_1","1_2","1_2","1_2")),
  price =  c(3.235536,3.205588, 3.235930, 3.235536,3.205588, 3.235930),
  wind  =  c(-2.302585, 3.871201, 5.123964, NA, NA, NA)
)

test_dataset_lm <- df.h2 %>%
  mutate_if(is.factor, droplevels) %>%    # this is used to exclude leftover
                                          # factor levels from previous 
                                          # processing in the flow
  group_by( hour ) %>%
  do(fitHour = safely(lm)(price ~ wind, data = .)) %>%
  unnest() %>%
  group_by(hour) %>%
  mutate(id = str_c("fitHour_", row_number() ) ) %>%  # Exploit that the 
                                                      # $error list is always 
                                                      # in the second position of every 
                                                      # output / modelfit trial
  spread(id, fitHour) %>%
  filter( fitHour_2 == "NULL" ) %>%
  rename(fitHour = fitHour_1) %>%
  select(-fitHour_2) %>% ungroup() %>%
  broom::tidy()

最新方法在计算上更加昂贵,并且在大型数据集中运行预计会更重

与tidy()结合使用时,此方法产生的错误是:

Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Error during wrapup: evaluation nested too deeply: infinite recursion / options(expressions=)?

我怀疑所有错误消息都是由于对生成的嵌套列表的结构进行了整齐(和一目了然)而导致的,但是我只是想不出如何正确地使其等同于未包装的版本。

能否请您帮助我了解问题的根源?

谢谢!

1 个答案:

答案 0 :(得分:0)

这是修复safely方法的一种方法

library(tidyverse)
df.h2 %>%
  mutate_if(is.factor, droplevels) %>%    # this is used to exclude leftover
  # factor levels from previous 
  # processing in the flow
  group_by( hour ) %>%
  do(fitHour = safely(lm)(price ~ wind, data = .)) %>% 
  #Create a new column to check if 'result' in each fitHour element is missing/null
  mutate(Ind_null = map_lgl(fitHour['result'], is.null)) %>%
  filter(!Ind_null) %>%
  mutate(fit = list(tidy(fitHour[['result']]))) %>% 
  unnest(fit)

# A tibble: 2 x 6
   hour  term        estimate std.error statistic p.value
  <fct> <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 1_1   (Intercept)  3.23      0.0162    200.    0.00319
2 1_1   wind        -0.00152   0.00411    -0.370 0.775  

更新 :(使用更短的路径)purrr::map_df

lm_safe <- safely(lm)
df.h2 %>%
mutate_if(is.factor, droplevels) %>%  
split( .$hour ) %>% 
map_df(~tidy(lm_safe(price ~ wind, data = .)[['result']]), .id = 'hour')