我的输入文件有不同的EOL /换行符。
(第一行以CRLF
结尾,后跟CR
行。)
Base-R函数可以正常处理我的文件(read.table()
或scan()
或...)中的任何一个。
来自scan()
的帮助文件:
无论打开什么模式的连接,LF,CRLF或CR中的任何一个都将被接受作为一行的EOL标记,因此匹配sep =“\ n”。
但是我需要data.table功能(多个大文件)而fread()
似乎只从第一行推断EOL,导致后面的行在开头被截断。
> myfile <- tempfile()
> writeLines("foo,1\r\nbar,2\rfoobar,3\r\ncoucou,4\rzut,5\nflute,6\n",
sep = "", con = myfile)
> (with_readLines <- readLines(myfile))
[1] "foo,1" "bar,2" "foobar,3" "coucou,4" "zut,5" "flute,6"
> (with_read.table <- read.table(myfile, sep = ","))
V1 V2
1 foo 1
2 bar 2
3 foobar 3
4 coucou 4
5 zut 5
6 flute 6
> (with_scan <- scan(myfile, sep = ",", what = list(character(), integer())))
Read 6 records
[[1]]
[1] "foo" "bar" "foobar" "coucou" "zut" "flute"
[[2]]
[1] 1 2 3 4 5 6
> (with_fread <- data.table::fread(myfile, sep = ","))
V1 V2
1: foo 1
2: bar 2
3: oobar 3
4: coucou 4
Warning message:
In data.table::fread(myfile, sep = ",") :
Stopped reading at empty line 5 but text exists afterwards (discarded): ut,5
flute,6
> file.remove(myfile)
[1] TRUE
我目前使用setDT(myDT <- read.table(myfile, sep = ","))
,但对此并不满意。有什么想法吗?
> sessionInfo()
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8 LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6
loaded via a namespace (and not attached):
[1] rsconnect_0.4.1.11 tools_3.2.4 chron_2.3-47