Question

我正在尝试下载一个大型的纽约出租车数据数据库，可在NYC TLC website公开发布。

library(data.table)
feb14 <- fread('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv', header = T)

执行上面的代码成功下载数据（需要几分钟），但由于内部错误而无法解析。我也尝试删除header = T。

是否有解决方法来处理＆＃34;不寻常的行结尾＆＃34;在fread？

Error in fread("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv",  : 
  Internal error. No eol2 immediately before line 3 after sep detection.
In addition: Warning message:
In fread("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv",  :
  Detected eol as \n\r, a highly unusual line ending. According to Wikipedia the Acorn BBC used this. If it is intended that the first column on the next row is a character column where the first character of the field value is \r (why?) then the first column should start with a quote (i.e. 'protected'). Proceeding with attempt to read the file.

Answer 1

似乎问题可能是由于原始.csv文件中标题和数据之间存在空行而引起的。使用notepad ++从.csv中删除该行似乎为我解决了这个问题。

Answer 2

有时其他选项如read.csv / read.table 可以表现不同......所以你总是可以试试。（也许源代码告诉了为什么，没有调查过）。

另一种选择是使用readLines（）来读取这样的文件。据我所知，这里没有解析/格式化。因此，据我所知，这是读取文件的最基本方法

最后，快速修复：在fread中使用'skip = ...'选项，或者通过说'nrows = ...'来控制结尾。

Answer 3

fread有些可疑。 data.table是用于读取大文件的更快，更高性能，但在这种情况下，行为不是最佳的。您可能希望在github

上提出此问题

即使使用nrows = 5，也可以使用nrows = 1，我能够在下载的文件上重现该问题，但前提是坚持使用原始文件。如果我复制粘贴前几行然后尝试，问题就消失了。如果我直接从网上用小nrows阅读，问题也就消失了。这甚至不是encoding问题，因此我建议提出问题。

我尝试使用read.csv和100,000行无异常地阅读该文件，且不到6秒。

feb14_2 <- read.csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", header = T, nrows = 100000)

header = T是一个多余的参数，因此fread不会产生影响，但read.csv需要它。

发现异常行结束导致错误

3 个答案: