使用reshape2包和dcast在R中进行段错误

时间:2013-03-05 18:31:48

标签: r segmentation-fault reshape2

当我尝试使用dcast(来自reshape2包)重塑特定数据框时,RStudio崩溃了。我发现崩溃实际上发生在R本身,所以我在R.app中运行了我的强制转换代码并得到了错误类型,它为该站点命名:Error: segfault from C stack overflow。在Google和SO的帮助下,我了解到这是一个内存访问错误。

好的,我走得那么远,但我不知道从哪里开始。我无法提供真正可重现的示例,因为我的数据框大约是558,000行,并且小玩具示例中不会出现此问题。例如,即使我拿了一个50,000行的数据子集,dcast也可以。是否会出现导致问题的特定数据行?如果是这样,任何人都可以建议寻找哪些功能可能导致我得到的错误类型?

这是我正在构建的数据框的子集(对于某些变量使用伪值),然后是我正在使用的转换函数。我还在下面的dput函数中包含了这一小段数据,以防它玩起来有所帮助。实际数据集包含大约700个prog值,15个prog1值和5个fa.type值。

  id        term   yr    nslds acad.lev    prog            prog1 fa.type amount
1  1   Fall 2009 2010 Graduate Graduate  loan 1      Other Loans    Loan   5000
2  1 Spring 2010 2010 Graduate Graduate  loan 1      Other Loans    Loan   5000
3  2   Fall 2009 2010 Graduate Graduate  loan 2    Stafford Loan    Loan   8781
4  2 Spring 2010 2010 Graduate Graduate  loan 2    Stafford Loan    Loan   8781
5  3   Fall 2007 2008 Graduate Graduate  loan 3    Stafford Loan    Loan   4250
6  3   Fall 2007 2008 Graduate Graduate grant 1 University Grant   Grant   1707

fa.wide = dcast(id + term + yr + nslds + acad.lev ~ prog1 + fa.type , data=fa, value.var="amount", fun.aggregate=sum)

fa = structure(list(id = c(1, 1, 2, 2, 3, 3), term = structure(c(7L, 
8L, 7L, 8L, 1L, 1L), .Label = c("Fall 2007", "Spring 2008", "Summer 2008", 
"Fall 2008", "Spring 2009", "Summer 2009", "Fall 2009", "Spring 2010", 
"Summer 2010", "Fall 2010", "Spring 2011", "Summer 2011", "Fall 2011", 
"Spring 2012", "Summer 2012", "Fall 2012", "Spring 2013"), class = c("ordered", 
"factor")), yr = c(2010L, 2010L, 2010L, 2010L, 2008L, 2008L), 
    nslds = structure(c(7L, 7L, 7L, 7L, 7L, 7L), .Label = c("1st Year, Never Attended", 
    "1st Year, Previously Attended", "2nd Year", "3rd Year", 
    "4th Year", "5th Year+", "Graduate"), class = c("ordered", 
    "factor")), acad.lev = structure(c(6L, 6L, 6L, 6L, 6L, 6L
    ), .Label = c("Freshman", "Sophomore", "Junior", "Senior", 
    "PB Undergrad", "Graduate"), class = c("ordered", "factor"
    )), prog = c("loan 1", "loan 1", "loan 2", "loan 2", "loan 3", 
    "grant 1"), prog1 = c("Other Loans", "Other Loans", "Stafford Loan", 
    "Stafford Loan", "Stafford Loan", "University Grant"), fa.type = structure(c(3L, 
    3L, 3L, 3L, 3L, 2L), .Label = c("Athletic", "Grant", "Loan", 
    "Scholarship", "Waiver", "Work/Study"), class = "factor"), 
    amount = c(5000, 5000, 8781, 8781, 4250, 1707)), .Names = c("id", 
"term", "yr", "nslds", "acad.lev", "prog", "prog1", "fa.type", 
"amount"), row.names = c(NA, 6L), class = "data.frame")

3 个答案:

答案 0 :(得分:7)

这不是一个答案,而是一个简单(非感性)可重复的例子,不适合评论。您可以使用这个简单的示例(在我的MacBookPro上)重新创建此错误。

require(reshape2)
n = 1448
df <- data.frame( Student = rep( 1:n , each = 2 ) , Grade = sample( 100 , n*2 , repl = TRUE ) )
df2 <- dcast( df , Student ~ Student , value.var = "Grade" , sum )
Error: segfault from C stack overflow

错误发生在边界n = 1448,即n=1447及以下时不会发生。似乎该错误来自包split_indices的{​​{1}} split-numeric.c。它可能与分组级别的数量被分配给(无符号?)整数值的事实有关,如果组的数量超过32767,则会导致内存访问错误,但TBH我现在紧抓着吸管。

如果有人无法重新创建此错误,我的plyr是:

sessionInfo()

有趣的是,如果我在收到第一个错误后再次运行R version 2.15.2 (2012-10-26) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] reshape2_1.2.2 loaded via a namespace (and not attached): [1] plyr_1.8 stringr_0.6.2 命令,则R会完全崩溃,并且会收到一些操作系统生成的错误报告。我在此处包含崩溃日志的相关部分:

df2 <-

答案 1 :(得分:1)

我在使用包reshape2中的dcast将长表转换为宽表时遇到了同样的问题。我在这篇文章plyr split_indices function crashes for long vectors中找到了解决方案。具体来说,您可以在此页面https://github.com/hadley/plyr/tree/master/src下载split_numeric.c和loop-apply.c。从R控制台卸载软件包plyr,最后在本地重新安装软件包:install.packages('/ path / to / source',repos = NULL,type ='source')。

这解决了我的问题,希望它有所帮助。

答案 2 :(得分:0)

为了结束这个老问题,这是一个错误,已按照this github issue中的说明进行了修复。