我试图在EC2实例上将大文件读入R中。但是,在阅读一些数据后,我的运行时间远远超过fread报告的时间。
下面,例如,当我只读取我的csv文件的第一行数据时,我有fb的verbose = TRUE输出。如您所见,报告的运行时间比实际运行时间短得多。你知道为什么会这样吗?有没有什么方法可以加快这个过程,所以它更符合读取数据后fread报告的运行时间?
> start_time <- Sys.time()
> fread(file_name_1, nrows=1, verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 68.770914 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 55 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: bank_num,b
All the fields on line 1 are character fields. Treating as the column names.
nrow set to nrows passed in (1)
Type codes (point 0): 1114434134111034444411333333333333333333333333333311111
Type codes: 1114434134111034444411333333333333333333333333333311111 (after applying colClasses and integer64)
Type codes: 1114434134111034444411333333333333333333333333333311111 (after applying drop or select (if supplied)
Allocating 55 column slots (55 - 0 dropped)
Read 1 rows and 55 (of 55) columns from 68.771 GB file in 00:00:27
Read 1 rows. Exactly what was estimated and allocated up front
26.480s (100%) Memory map (rerun may be quicker)
0.000s ( 0%) sep and header detection
0.000s ( 0%) Count rows (wc -l)
0.000s ( 0%) Column type detection (100 rows at 10 points)
0.000s ( 0%) Allocation of 1x55 result (xMB) in RAM
0.000s ( 0%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 0%) Changing na.strings to NA
26.480s Total
> end_time <- Sys.time()
> end_time - start_time
Time difference of 9.695263 mins
答案 0 :(得分:1)
请始终说明版本号;例如输出sessionInfo()
。但我可以告诉你可能正在使用CRAN版本。
在询问Stack Overflow之前,请务必先检查NEWS。
第3项(在许多其他fread
改进中):
懒惰的记忆图;例如对于9GB文件,只读取
nrow=10
的前10行是 12s,低至0.01s 。接近RAM限制的大文件也可以更可靠地工作。进度表将更快,更一致地开始。
可以使用this install command 轻松尝试使用dev的最新版本。您编写了EC2,因此可能是Linux,但任何Windows用户都可以使用Windows.zip from dev而无需任何工具。
由于你拥有68GB的csv,你肯定会从data.table v1.10.5 +中受益匪浅。请在此更新您如何继续使用它。