我有一个复杂的dplyr代码块,它在包含5,200,000行的数据框架上成功运行。自从我编写代码后,我将R版本从3.1.2更新到3.2.0,目前使用的是Revolution R Open(RRO)3.2.0
现在在相同数据上运行代码块导致RStudio错误
fatal error - R Session Aborted
在RRO 3.2.0和正常R 3.2.0下发生错误
我同样不确定是否使用的窗口函数(lag& row_number)是罪魁祸首。我最感兴趣的是找出导致语句崩溃R / RStudio的原因,而不是重写dplyr语句,但很高兴收到有关更好的dplyr实践的提示:-)
我在Stackoverflow上查看了以下问题dplyr crash when using lagged difference computation 和 dplyr crashes when using summarise with segfault error 但不要觉得它们与我的查询有关。
我可以使用切片运算符在一半数据上成功运行dplyr语句,并且在另一半数据上同样成功运行,因此我不相信这是数据的问题。
我能够使用样本数据在数据框上复制错误。
这是生成样本数据帧DF
的代码library(dplyr)
# create an ID column with some containing duplicate values
set.seed(1)
DF <- data.frame(ID = floor(runif(5200000, 1,3000000)))
# Order data frame by ID, YEAR
DF <- tbl_df(DF) %>%
group_by(ID) %>%
mutate(YEAR = row_number()) %>%
arrange(ID, YEAR)
# create and event variable which is set to 0 80% of the time 1 10% etc.
DF$EVENT <- sample(0:5,5200000, replace = TRUE, prob = c(0.8, 0.1, 0.05, 0.025, 0.015, 0.01))
# create a vector of unique IDs
unique_IDs <- unique(DF$ID)
# take a 10% sample of the unique IDs
init_set <- sample(unique_IDs, replace = FALSE, size = round(length(unique_IDs)*0.1) )
# create an index of the 10% sample IDs
init.idx <- DF$ID %in% init_set
# create an initialisation state variable with Y and N values
DF$INIT_STATE <- as.factor(ifelse(init.idx,"Y","N"))
我正在运行的dplyr语句如下所示:
tbl_df(DF) %>%
select(ID, YEAR, EVENT, INIT_STATE) %>%
# slice(1:2600000) %>%
group_by(ID) %>% # group by ID to control window functions
arrange(ID, YEAR) %>% # sort by ID, YEAR (just to be sure, may not be needed)
mutate(event_lag = lag(EVENT) # add attr which shifts the event number by a lag of 1 (YEAR_COUNTER is set to zero in the year after the event)
, event_lag = ifelse(is.na(event_lag), 0, event_lag) ) %>% # first lag in the ID group is NA, this sets it to 0
mutate(i = cumsum(ifelse(event_lag, 1, 0))) %>% # create cumulative count of lagged number of events (used for grouping)
group_by(i, add = TRUE) %>% # now add another clause to the group by
mutate(row_rank = row_number() # row_number is a counter that restarts in every group (ID and i)
, year_ini = ifelse(i == 0 & INIT_STATE == "N", 5, 0) # add attribute that determines if the EVENT_COUNTER starts at 5 yrs or 0 yrs
, YEAR_COUNTER = year_ini + row_rank - 1) %>% # the EVENT_COUNTER is now the sum between the EVENT initialisation + the row counter. -1 starts counter from 0
select(-(event_lag:year_ini))
我在dplyr语句中为每一行添加了注释,以指示每个步骤的用途。
成功运行一半的数据如下:
Source: local data frame [2,600,000 x 6]
Groups: ID, i
i ID YEAR EVENT INIT_STATE YEAR_COUNTER
1 0 1 1 1 N 5
2 1 1 2 0 N 0
3 1 1 3 0 N 1
4 1 1 4 0 N 2
5 1 1 5 0 N 3
6 0 2 1 0 N 5
7 0 3 1 0 N 5
8 0 3 2 0 N 6
9 0 3 3 2 N 7
10 0 4 1 0 N 5
.. . .. ... ... ... ...
除了下面的会话信息,我在服务器上有192Gb RAM,而且在dplyr语句运行时我没有看到内存使用量出现任何显着高峰。
sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.4.1
loaded via a namespace (and not attached):
[1] lazyeval_0.1.10 magrittr_1.5 assertthat_0.1 parallel_3.2.0 DBI_0.3.1 tools_3.2.0 Rcpp_0.11.6
答案 0 :(得分:1)
我刚刚更新到RRO 3.2.1和dplyr 0.4.2,这似乎解决了这个问题,这是个好消息。感谢任何看过这个问题的人。