在大数据帧

时间:2015-08-25 16:29:45

标签: r dplyr

我有一个复杂的dplyr代码块,它在包含5,200,000行的数据框架上成功运行。自从我编写代码后,我将R版本从3.1.2更新到3.2.0,目前使用的是Revolution R Open(RRO)3.2.0

现在在相同数据上运行代码块导致RStudio错误

fatal error - R Session Aborted

在RRO 3.2.0和正常R 3.2.0下发生错误

我同样不确定是否使用的窗口函数(lag& row_number)是罪魁祸首。我最感兴趣的是找出导致语句崩溃R / RStudio的原因,而不是重写dplyr语句,但很高兴收到有关更好的dplyr实践的提示:-)

我在Stackoverflow上查看了以下问题dplyr crash when using lagged difference computationdplyr crashes when using summarise with segfault error 但不要觉得它们与我的查询有关。

我可以使用切片运算符在一半数据上成功运行dplyr语句,并且在另一半数据上同样成功运行,因此我不相信这是数据的问题。

我能够使用样本数据在数据框上复制错误。

这是生成样本数据帧DF

的代码
library(dplyr)
# create an ID column with some containing duplicate values
set.seed(1)
DF <- data.frame(ID = floor(runif(5200000, 1,3000000)))

# Order data frame by ID, YEAR
DF <- tbl_df(DF) %>%
group_by(ID) %>%
mutate(YEAR = row_number()) %>%
arrange(ID, YEAR)     

# create and event variable which is set to 0 80% of the time 1 10% etc.
DF$EVENT <- sample(0:5,5200000, replace = TRUE, prob = c(0.8, 0.1, 0.05, 0.025, 0.015, 0.01))

# create a vector of unique IDs
unique_IDs <- unique(DF$ID)
# take a 10% sample of the unique IDs
init_set <- sample(unique_IDs, replace = FALSE, size =  round(length(unique_IDs)*0.1) )
# create an index of the 10% sample IDs
init.idx <- DF$ID %in% init_set

# create an initialisation state variable with Y and N values
DF$INIT_STATE <- as.factor(ifelse(init.idx,"Y","N"))

我正在运行的dplyr语句如下所示:

tbl_df(DF) %>%     
    select(ID, YEAR, EVENT, INIT_STATE) %>%
    # slice(1:2600000) %>%
    group_by(ID) %>%                                                    # group by ID to control window functions
    arrange(ID, YEAR) %>%                                               # sort by ID, YEAR (just to be sure, may not be needed)
    mutate(event_lag = lag(EVENT)                                       # add attr which shifts the event number by a lag of 1 (YEAR_COUNTER is set to zero in the year after the event)
           , event_lag = ifelse(is.na(event_lag), 0, event_lag) ) %>%   # first lag in the ID group is NA, this sets it to 0
    mutate(i = cumsum(ifelse(event_lag, 1, 0))) %>%                     # create cumulative count of lagged number of events (used for grouping)
    group_by(i, add = TRUE) %>%                                         # now add another clause to the group by 
    mutate(row_rank = row_number()                                      # row_number is a counter that restarts in every group (ID and i)
           , year_ini = ifelse(i == 0 & INIT_STATE == "N", 5, 0)        # add attribute that determines if the EVENT_COUNTER starts at 5 yrs or 0 yrs
           , YEAR_COUNTER = year_ini + row_rank - 1) %>%                # the EVENT_COUNTER is now the sum between the EVENT initialisation + the row counter. -1 starts counter from 0
    select(-(event_lag:year_ini))  

我在dplyr语句中为每一行添加了注释,以指示每个步骤的用途。

成功运行一半的数据如下:

Source: local data frame [2,600,000 x 6]
Groups: ID, i

   i ID YEAR EVENT INIT_STATE YEAR_COUNTER
1  0  1    1     1          N            5
2  1  1    2     0          N            0
3  1  1    3     0          N            1
4  1  1    4     0          N            2
5  1  1    5     0          N            3
6  0  2    1     0          N            5
7  0  3    1     0          N            5
8  0  3    2     0          N            6
9  0  3    3     2          N            7
10 0  4    1     0          N            5
.. . ..  ...   ...        ...          ...

除了下面的会话信息,我在服务器上有192Gb RAM,而且在dplyr语句运行时我没有看到内存使用量出现任何显着高峰。

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.1

loaded via a namespace (and not attached):
[1] lazyeval_0.1.10 magrittr_1.5    assertthat_0.1  parallel_3.2.0  DBI_0.3.1       tools_3.2.0     Rcpp_0.11.6 

1 个答案:

答案 0 :(得分:1)

我刚刚更新到RRO 3.2.1和dplyr 0.4.2,这似乎解决了这个问题,这是个好消息。感谢任何看过这个问题的人。