Question

调用read.table()函数（在CSV文件上），如下所示：

  download.file(url, destfile = file, mode = "w")
  conn <- gzcon(bzfile(file, open = "r"))
  try(fileData <- read.table(conn, sep = ",", row.names = NULL), silent = FALSE)

产生以下错误：

Error in pushBack(c(lines, lines), file) : 
  can only push back on text-mode connections

我试图通过tConn <- textConnection(readLines(conn))显式“包裹”连接[然后，当然，将tConn而不是conn传递给read.table()]，但它触发极端缓慢在代码执行和最终挂起或R进程中（必须重新启动R）。

更新（再次显示了尝试向其他人解释您的问题是多么有用！）：

当我写这篇文章时，我决定回到文档并再次阅读gzcon()，我认为这不仅解压缩bzip2文件，而且将其标记为文本。但后来我意识到这是一个荒谬的假设，因为我知道它是CSV档案中的文本（bzip2）文件，但R没有。因此，我最初尝试使用textConnection()是正确的方法，但是某些事情会产生问题。如果 - 这是一个很大的IF - 我的逻辑是正确的，直到这个，下一个问题是问题是由于textConnection()还是readLines()。

请指教。谢谢！

P.S。我正在尝试读取的CSV文件采用“几乎”CSV格式，因此我无法使用标准R函数进行CSV处理。

===

更新1（节目输出）：

===

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectAuthors2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 514960 bytes (502 Kb)
opened URL
==================================================
downloaded 502 Kb

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectDependencies2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 133295 bytes (130 Kb)
opened URL
==================================================
downloaded 130 Kb

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectDescriptions2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 5404286 bytes (5.2 Mb)
opened URL
==================================================
downloaded 5.2 Mb

===

更新2（节目输出）：

===

很长一段时间后，我收到以下消息，然后程序继续处理剩下的文件：

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 8 elements

然后情况重复：在处理了几个较小（小于1MB）的文件后，程序在处理较大（> 1MB）的文件时“冻结”：

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectTags2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 1226391 bytes (1.2 Mb)
opened URL
==================================================
downloaded 1.2 Mb

===

更新3（节目输出）：

===

在给予程序更多时间后，我发现了以下内容：

*）我假设文件大小~1MB在奇怪的行为中起作用是错误的。这是基于程序成功处理大小为＆gt;的文件的事实。 1MB，无法处理大小＆lt; 1MB。这是一个带错误的示例输出：

trying URL 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectInfo2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 826288 bytes (806 Kb)
opened URL
==================================================
downloaded 806 Kb

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 4 elements
In addition: Warning messages:
1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string
2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string

处理非常小的文件时出错的示例：

trying URL 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectRequirements2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 3092 bytes
opened URL
==================================================
downloaded 3092 bytes

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 2 did not have 2 elements

从上面的例子可以看出，大小不是因素，但文件结构可能是。

*）我错误地报告了最大文件大小，压缩了54.2MB。这是文件，处理不仅会生成错误消息并继续，但实际上会触发不可恢复的错误并停止（退出）：

trying URL 'http://flossdata.syr.edu/data/gc/2012/2012-Nov/gcProjectInfo2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 56793796 bytes (54.2 Mb)
opened URL
=================================================
downloaded 54.2 Mb

Error in textConnection(readLines(conn)) : 
  cannot allocate memory for text connection

*）紧急退出后，五个R进程每个使用51％的内存，而在手动R重启后，这个数字仍为7％（每htop个报告的数据）。

即使考虑到“非常糟糕”的文本/ CSV格式的可能性（由“扫描（）中的错误消息”提示），标准R函数textConnection()和/或readLines()的行为也是如此我很奇怪，甚至“可疑”。我的理解是，良好的功能应该优雅地处理错误的输入数据，允许非常有限的时间/重试，然后在可能的情况下继续处理，或者在无法进一步处理时退出。在这种情况下，我们看到（通过缺陷票据截图）R进程对虚拟机的内存和处理器造成负担。

Answer 1

过去发生这种情况时，我通过不使用“textConnection”获得更好的性能。相反，如果我必须使用'readLines'进行一些预处理，我会将数据写入临时文件，然后将该文件用作'read.table'的输入。

Answer 2

您没有CSV文件。我只看了（是的，实际上看了一下文本编辑器）其中一个，但它们似乎是以制表符分隔的。

url <- 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectRequirements2012-Nov.txt.bz2'
file <- "temp.txt.bz2"
download.file(url, destfile = file, mode = "w")
dat <- bzfile(file, open = "r")
DF <- read.table(dat, header=TRUE, sep="\t")
close(dat)

head(DF)
#   proj_num proj_unixname               requirement       requirement_type      date_collected datasource_id
# 1       14          A2ps                    E-mail           Help,Support 2012-11-02 10:57:40           346
# 2       99          Acct                    E-mail           Bug Tracking 2012-11-02 10:57:40           346
# 3      128          Adns    VCS Repository Webview              Developer 2012-11-02 10:57:40           346
# 4      128          Adns                    E-mail                   Help 2012-11-02 10:57:40           346
# 5      196        AmaroK    VCS Repository Webview           Bug Tracking 2012-11-02 10:57:40           346
# 6      196        AmaroK Mailing List Info/Archive Bug Tracking,Developer 2012-11-02 10:57:40           346

非常慢的R代码和挂起

2 个答案: