Question

我的输入文件有不同的EOL /换行符。（第一行以CRLF结尾，后跟CR行。）

Base-R函数可以正常处理我的文件（read.table()或scan()或...）中的任何一个。来自scan()的帮助文件：

无论打开什么模式的连接，LF，CRLF或CR中的任何一个都将被接受作为一行的EOL标记，因此匹配sep =“\ n”。

但是我需要data.table功能（多个大文件）而fread()似乎只从第一行推断EOL，导致后面的行在开头被截断。

> myfile <- tempfile()
> writeLines("foo,1\r\nbar,2\rfoobar,3\r\ncoucou,4\rzut,5\nflute,6\n",
    sep = "", con = myfile)
> (with_readLines <- readLines(myfile))
[1] "foo,1"    "bar,2"    "foobar,3" "coucou,4" "zut,5"    "flute,6" 
> (with_read.table <- read.table(myfile, sep = ","))
      V1 V2
1    foo  1
2    bar  2
3 foobar  3
4 coucou  4
5    zut  5
6  flute  6
> (with_scan <- scan(myfile, sep = ",", what = list(character(), integer())))
Read 6 records
[[1]]
[1] "foo"    "bar"    "foobar" "coucou" "zut"    "flute" 

[[2]]
[1] 1 2 3 4 5 6

> (with_fread <- data.table::fread(myfile, sep = ","))
       V1 V2
1:    foo  1
2:    bar  2
3:  oobar  3
4: coucou  4
Warning message:
In data.table::fread(myfile, sep = ",") :
  Stopped reading at empty line 5 but text exists afterwards (discarded): ut,5
flute,6
> file.remove(myfile)
[1] TRUE

我目前使用setDT(myDT <- read.table(myfile, sep = ","))，但对此并不满意。有什么想法吗？

> sessionInfo()
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6

loaded via a namespace (and not attached):
[1] rsconnect_0.4.1.11 tools_3.2.4        chron_2.3-47

使用不同的EOL搜索文件

0 个答案: