我一直在尝试在大型数据集中拟合线性模型(每组约3千4百万个组)。由于我的数据集的绝对大小,在某些情况下,模型拟合将不可用。因此,我在lm模型拟合中遇到与NA相关的错误,例如:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
当我尝试将lm包裹在TryCatch或Salely周围以在拟合过程中捕获并处理随机错误时,就会出现问题
到目前为止,我尝试了几种方法。在这里,我给您一个虚拟数据集:
为避免此错误,我尝试了几种方法。您可能会说,最简单的方法是对数据进行分组过滤,并排除所有充满NA的数据集(我也曾尝试这样做,但显然有些数据集存在其他问题,并导致我的模型拟合遇到错误)。
因此,由于我想拥有一个适用于maany类型数据的健壮流程,因此我试图编写一种相当通用且具有故障保护功能的东西。
此外,我希望使用multidplyr进行并行化,所以我采用了dplyr的方式。这是现在的问题。一旦绕过tryCatch或安全地绕过lm,即使以后我取消嵌套,扫帚:tidy也无法处理结果。 (这是一个很大的遗憾,因为它使我的生活变得很轻松,而不必一直都在编写代码)
情况1:简单的lm适合标准方式,再加上扫帚。一切都很好
library(tidyverse)
df.h= tibble(
hour = factor(c("1_1","1_1","1_1")),
price = c(3.235536,3.205588, 3.235930),
wind = c(-2.302585, 3.871201, 5.123964)
)
dfHour = df.h2 %>% group_by(hour) %>%
do(fitHour = lm(price ~ wind, data = .))
dfHourCoef = broom::tidy(dfHour, fitHour)
引入故障数据集以在lm中产生错误:
library(tidyverse)
df.h2= tibble(
hour = factor(c("1_1","1_1","1_1","1_2","1_2","1_2")),
price = c(3.235536,3.205588, 3.235930, 3.235536,3.205588, 3.235930),
wind = c(-2.302585, 3.871201, 5.123964, NA, NA, NA)
)
dfHour2 = df.h2 %>% group_by(hour) %>%
do(fitHour = tryCatch( lm(myy ~ myx, data = . ), error= function(e){return("FAILURE")} ) ) %>%
filter(!is.character(fitHour)) # Exploit the fact that all good outputs are
# a list while faulty output is a character
# to perform filtering
# get the coefficients by group in a tidy data_frame
dfHourCoef2 = broom::tidy(dfHour2, fitHour)
这会产生以下错误:
Error in .[[object]][[1]] : subscript out of bounds
方法三:安全包装以捕获错误消息
library(tidyverse)
df.h2= tibble(
hour = factor(c("1_1","1_1","1_1","1_2","1_2","1_2")),
price = c(3.235536,3.205588, 3.235930, 3.235536,3.205588, 3.235930),
wind = c(-2.302585, 3.871201, 5.123964, NA, NA, NA)
)
test_dataset_lm <- df.h2 %>%
mutate_if(is.factor, droplevels) %>% # this is used to exclude leftover
# factor levels from previous
# processing in the flow
group_by( hour ) %>%
do(fitHour = safely(lm)(price ~ wind, data = .)) %>%
unnest() %>%
group_by(hour) %>%
mutate(id = str_c("fitHour_", row_number() ) ) %>% # Exploit that the
# $error list is always
# in the second position of every
# output / modelfit trial
spread(id, fitHour) %>%
filter( fitHour_2 == "NULL" ) %>%
rename(fitHour = fitHour_1) %>%
select(-fitHour_2) %>% ungroup() %>%
broom::tidy()
最新方法在计算上更加昂贵,并且在大型数据集中运行预计会更重
与tidy()结合使用时,此方法产生的错误是:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Error during wrapup: evaluation nested too deeply: infinite recursion / options(expressions=)?
我怀疑所有错误消息都是由于对生成的嵌套列表的结构进行了整齐(和一目了然)而导致的,但是我只是想不出如何正确地使其等同于未包装的版本。
能否请您帮助我了解问题的根源?
谢谢!
答案 0 :(得分:0)
这是修复safely
方法的一种方法
library(tidyverse)
df.h2 %>%
mutate_if(is.factor, droplevels) %>% # this is used to exclude leftover
# factor levels from previous
# processing in the flow
group_by( hour ) %>%
do(fitHour = safely(lm)(price ~ wind, data = .)) %>%
#Create a new column to check if 'result' in each fitHour element is missing/null
mutate(Ind_null = map_lgl(fitHour['result'], is.null)) %>%
filter(!Ind_null) %>%
mutate(fit = list(tidy(fitHour[['result']]))) %>%
unnest(fit)
# A tibble: 2 x 6
hour term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 1_1 (Intercept) 3.23 0.0162 200. 0.00319
2 1_1 wind -0.00152 0.00411 -0.370 0.775
更新 :(使用更短的路径)purrr::map_df
lm_safe <- safely(lm)
df.h2 %>%
mutate_if(is.factor, droplevels) %>%
split( .$hour ) %>%
map_df(~tidy(lm_safe(price ~ wind, data = .)[['result']]), .id = 'hour')