fread挂在某些类型的文件上

时间:2018-12-17 22:18:29

标签: r data.table fread

由于将R更新到3.5.1版并更新到了data.table的最新版本(1.11.18版),因此fread()在某些文件而非其他文件上被挂起。

> test_1<-fread("Dec_1_10.csv", verbose=TRUE)

omp_get_max_threads() = 4
omp_get_thread_limit() = 2147483647
DTthreads = 0
Input contains no \n. Taking this to be a filename to open

[01] Check arguments
  Using 4 threads (omp_get_max_threads()=4, nth=4)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer

[02] Opening the file
  Opening file Dec_1_10.csv
  File opened, size = 334.9MB (351129569 bytes).
  Memory mapped ok

[03] Detect and skip BOM

[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
'. Final end-of-line is missing. Using cow page to write 0 to the last byte.

[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<ID,NAME,GENDE>>

[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 1 lines of 26029650 fields using quote rule 0
  sep=','  with 9 lines of 31 fields using quote rule 2
  Detected 31 columns on line 2. This line is either column names or first data row. Line starts as: <<0126_V3","DSRI",>>
  Quote rule picked = 2
  fill=false and the most number of columns found is 31

[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 1 because (102965936 bytes from row 1 to eof) / (2 * 91563770 jump0size) == 0
  A line with too-many fields (31/31) was found on line 9 of sample jump 0. 
  Type codes (jump 000)    : AAAA2AAA52AAAAAAAA2AA22AAAAAA2A  Quote rule 2
Types in 1st data row match types in 2nd data row but previous row has 18402118 fields. Taking previous row as column names.  All rows were sampled since file is small so we know nrow=8 exactly

[08] Assign column names

[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : AAAA2AAA52AAAAAAAA2AA22AAAAAA2A2222222222222222222222222222222222222222222222222...2222222222

[10] Allocate memory for the datatable
  Allocating 18402118 column slots (18402118 - 0 dropped) with 8 rows

[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=102965936
  Too few rows allocated. Allocating additional 1024 rows (now nrows=1032) and continue reading from jump 0

。 。然后挂在这里,直到我强制退出R。

在其他.csv文件上调用fread()似乎可以正常工作,但是我具有此特定结构/大小的所有文件都无法解析。

编辑:我让R会话运行了几个小时,而不是几分钟后强行退出。

Error: vector memory exhausted (limit reached?)
In addition: Warning messages:
1: In FUN(X[[i]], ...) :
  Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
2: In FUN(X[[i]], ...) :
  Detected 5471442 column names but the data has 31 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.

我尝试跳过数据的第一行,并指定列名。似乎都无法克服这个问题。

0 个答案:

没有答案