我总是使用 data.table 包中的fread
来读取大表。但显然它不支持在Windows中读取unicode文件(Windows 7 Professional更精确)
这是我试过的文件:
A,B
ą,ž
ū,į
ų,ė
š,ę
如果我在Mac OS X中阅读它,或者我使用read.csv
选项encoding=UTF-8
阅读它,它可以正常工作。很遗憾fread
没有此option。
还有其他快速方法可以在Windows中读取unicode表,还是应该使用其他操作系统?或者我错过了一些明显的东西?
以下是sessionInfo():
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.4
loaded via a namespace (and not attached):
[1] chron_2.3-45 plyr_1.8.1 Rcpp_0.11.5 reshape2_1.4.1 stringr_0.6.2
更新:按要求粘贴输出。
> aa<-fread("F:/R/unicode_test2.csv",verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.000000 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 2 columns. Longest stretch was from line 1 to line 5
Starting data input on line 1 (either column names or first row of data). First 10 characters: Ä„,B
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 5 (including 1 at the end)
Count of sep: 4
nrow = MIN( nsep [4] / ncol [2] -1, neol [5] - nblank [1] ) = 4
Type codes ( first 5 rows): 44
Type codes: 44 (after applying colClasses and integer64)
Type codes: 44 (after applying drop or select (if supplied)
Allocating 2 column slots (2 - 0 dropped)
Read 4 rows. Exactly what was estimated and allocated up front
0.000s ( 0%) Memory map (rerun may be quicker)
0.000s ( 0%) sep and header detection
0.000s ( 0%) Count rows (wc -l)
0.000s ( 0%) Column type detection (first, middle and last 5 rows)
0.000s ( 0%) Allocation of 4x2 result (xMB) in RAM
0.000s ( 0%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 0%) Changing na.strings to NA
0.001s Total
> aa
Ä„ B
1: ą ž
2: ū į
3: ų ė
4: Å¡ Ä™
> aa$A
[1] "ą" "ū" "ų" "š"
> aa$B
[1] "ž" "į" "ė" "ę"
> bb <- read.csv("F:/R/unicode_test.csv",encoding="UTF-8",strings=FALSE)
> bb
A B
1 a ž
2 u i
3 u e
4 š e
> bb$B
[1] "ž" "į" "ė" "ę"
> bb$A
[1] "ą" "ū" "ų" "š"