Question

我有一个大小为53 Gb的文件，这是它的头部：

1   10  2873
1   100 22246
1   1000    28474
1   10000   35663
1   10001   35755
1   10002   35944
1   10003   36387
1   10004   36453
1   10005   36758
1   10006   37240

我在CentOS7 64位服务器上运行R 3.3.2，RAM为128 Gb。我已经将4098个类似的文件读入R.但是，我无法将最大的文件读入R中。

df <- read.table(f, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : har='')
  too many items

返回错误，说“太多项目”。然后我跟着这个tip：

df5rows <- read.table(f, nrows=5, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')
classes <- sapply(df5rows, class)
df <- read.table(f, nrows=3231959401, colClass=classes, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')

它仍然说“太多项目”，并且“引入了NAs”。我也试过没有colClasses，结果相同：

df <- read.table(f, nrows=3231959401, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : har='')
  too many items
In addition: Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  NAs introduced by coercion to integer range

使用的内存从未超过90 Gb（没有任何nrows或colClasses时，这些args从未超过60 Gb）。我不明白为什么R无法读取文件。

我还检查过没有包含4列或更多列的行。

Answer 1

您是否尝试使用（sed或VI）等轻量编辑器剪切文件？然后你只需合并两个数据集。在一个非常相似的大文件机器上，我遇到了同样的问题。它是一个垃圾线，关于文件的大小发生这种错误。

r read.table太多项目

1 个答案: