Question

我一直在尝试在大型数据集中拟合线性模型（每组约3千4百万个组）。由于我的数据集的绝对大小，在某些情况下，模型拟合将不可用。因此，我在lm模型拟合中遇到与NA相关的错误，例如：

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  0 (non-NA) cases

当我尝试将lm包裹在TryCatch或Salely周围以在拟合过程中捕获并处理随机错误时，就会出现问题

到目前为止，我尝试了几种方法。在这里，我给您一个虚拟数据集：

为避免此错误，我尝试了几种方法。您可能会说，最简单的方法是对数据进行分组过滤，并排除所有充满NA的数据集（我也曾尝试这样做，但显然有些数据集存在其他问题，并导致我的模型拟合遇到错误）。

因此，由于我想拥有一个适用于maany类型数据的健壮流程，因此我试图编写一种相当通用且具有故障保护功能的东西。

此外，我希望使用multidplyr进行并行化，所以我采用了dplyr的方式。这是现在的问题。一旦绕过tryCatch或安全地绕过lm，即使以后我取消嵌套，扫帚：tidy也无法处理结果。（这是一个很大的遗憾，因为它使我的生活变得很轻松，而不必一直都在编写代码）

情况1：简单的lm适合标准方式，再加上扫帚。一切都很好

library(tidyverse)

df.h= tibble(
  hour  =  factor(c("1_1","1_1","1_1")),
  price =  c(3.235536,3.205588, 3.235930),
  wind  =  c(-2.302585, 3.871201, 5.123964)
)

dfHour = df.h2 %>% group_by(hour) %>%
  do(fitHour = lm(price ~ wind, data = .))

dfHourCoef = broom::tidy(dfHour, fitHour)

引入故障数据集以在lm中产生错误：

library(tidyverse)

df.h2= tibble(
  hour  =  factor(c("1_1","1_1","1_1","1_2","1_2","1_2")),
  price =  c(3.235536,3.205588, 3.235930, 3.235536,3.205588, 3.235930),
  wind  =  c(-2.302585, 3.871201, 5.123964, NA, NA, NA)
)


dfHour2 = df.h2 %>% group_by(hour) %>%
  do(fitHour = tryCatch( lm(myy ~ myx, data = . ), error= function(e){return("FAILURE")} ) ) %>%
  filter(!is.character(fitHour)) # Exploit the fact that all good outputs are 
                                 # a list while faulty output is a character
                                 # to perform filtering

# get the coefficients by group in a tidy data_frame
dfHourCoef2 = broom::tidy(dfHour2, fitHour)

这会产生以下错误：

Error in .[[object]][[1]] : subscript out of bounds

方法三：安全包装以捕获错误消息


library(tidyverse)

df.h2= tibble(
  hour  =  factor(c("1_1","1_1","1_1","1_2","1_2","1_2")),
  price =  c(3.235536,3.205588, 3.235930, 3.235536,3.205588, 3.235930),
  wind  =  c(-2.302585, 3.871201, 5.123964, NA, NA, NA)
)

test_dataset_lm <- df.h2 %>%
  mutate_if(is.factor, droplevels) %>%    # this is used to exclude leftover
                                          # factor levels from previous 
                                          # processing in the flow
  group_by( hour ) %>%
  do(fitHour = safely(lm)(price ~ wind, data = .)) %>%
  unnest() %>%
  group_by(hour) %>%
  mutate(id = str_c("fitHour_", row_number() ) ) %>%  # Exploit that the 
                                                      # $error list is always 
                                                      # in the second position of every 
                                                      # output / modelfit trial
  spread(id, fitHour) %>%
  filter( fitHour_2 == "NULL" ) %>%
  rename(fitHour = fitHour_1) %>%
  select(-fitHour_2) %>% ungroup() %>%
  broom::tidy()

最新方法在计算上更加昂贵，并且在大型数据集中运行预计会更重

与tidy（）结合使用时，此方法产生的错误是：

Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Error during wrapup: evaluation nested too deeply: infinite recursion / options(expressions=)?

我怀疑所有错误消息都是由于对生成的嵌套列表的结构进行了整齐（和一目了然）而导致的，但是我只是想不出如何正确地使其等同于未包装的版本。

能否请您帮助我了解问题的根源？

谢谢！

Answer 1

这是修复safely方法的一种方法

library(tidyverse)
df.h2 %>%
  mutate_if(is.factor, droplevels) %>%    # this is used to exclude leftover
  # factor levels from previous 
  # processing in the flow
  group_by( hour ) %>%
  do(fitHour = safely(lm)(price ~ wind, data = .)) %>% 
  #Create a new column to check if 'result' in each fitHour element is missing/null
  mutate(Ind_null = map_lgl(fitHour['result'], is.null)) %>%
  filter(!Ind_null) %>%
  mutate(fit = list(tidy(fitHour[['result']]))) %>% 
  unnest(fit)

# A tibble: 2 x 6
   hour  term        estimate std.error statistic p.value
  <fct> <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 1_1   (Intercept)  3.23      0.0162    200.    0.00319
2 1_1   wind        -0.00152   0.00411    -0.370 0.775

更新：（使用更短的路径）purrr::map_df

lm_safe <- safely(lm)
df.h2 %>%
mutate_if(is.factor, droplevels) %>%  
split( .$hour ) %>% 
map_df(~tidy(lm_safe(price ~ wind, data = .)[['result']]), .id = 'hour')

在用Trycatch或安全包装时，扫帚无法处理嵌套的lm模型列表

1 个答案: