Question

我目前正在努力将一个大型数据集引入R中，我发现data.tables中的fread（）能够在合理的时间内将其引入（read.csv对我来说真的很慢）。

我目前遇到了几个问题，我想了解一些问题。我在列名称前面有这个“ï»¿”标记，我可以使用重命名语句快速修复，但此外，该列中的值与原始文件完全不同。该值应为16位数字标识符代码（如此“1100110011001100”），但当它被引入时，它将以“3.598E-310”形式出现。

我不知道这是否是由于我的数据所在的UTF-8格式，但我在查明发生了什么事时遇到了一些麻烦。还有另一个具有类似功能的变量（12位数字代码）也变为取幂。我所有其余变量看起来都很好（除了其他变量与两个变量的长度相同之外）。

Answer 1

你应该得到一个友好的警告：

library(data.table) #1.10.0

DT <- fread("1100110011001100
      1100110011001100")
#Warning message:
#In fread("1100110011001100\n      1100110011001100") :
#  Some columns have been read as type 'integer64' but package bit64 isn't loaded. Those columns will display as strange looking floating point data. There is no need to reload the data. Just require(bit64) to obtain the integer64 print method and print the data again.

print(DT)
#              V1
#1: 5.435266e-309
#2: 5.435266e-309
#Warning message:
#In print.data.table(DT) :
#  Some columns have been read as type 'integer64' but package bit64 isn't loaded. Those columns will display as strange looking floating point data. There is no need to reload the data. Just require(bit64) to obtain the integer64 print method and print the data again.

library(bit64)
print(DT)
#                 V1
#1: 1100110011001100
#2: 1100110011001100

Answer 2

如果我正确理解OP，则16位数字标识符代码应为字符类型。

但是，fread()确定了某些示例行的列类型（有关详细信息，请参阅?fread）。显然，它试图将数据读作integer64。 colClasses参数可用于覆盖fread()：

所做的猜测

DT <- fread("1100110011001100
      1100110011001100", colClasses = "character")
DT
#                 V1
#1: 1100110011001100
#2: 1100110011001100

如果verbose参数设置为TRUE，则fread()会显示其内部工作原理：

DT <- fread("1100110011001100
      1100110011001100", colClasses = "character", verbose = TRUE)
Input contains a \n (or is ""). Taking this to be text input (not a filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... Deducing this is a single column input.
Starting data input on line 1 (either column names or first row of data). First 10 characters: 1100110011
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol: 2 (including 0 at the end)
ncol==1 so sep count ignored
Type codes (point  0): 2
Column 1 ('V1') was detected as type 'integer64' but bumped to 'character' as requested by colClasses
Type codes: 4 (after applying colClasses and integer64)
Type codes: 4 (after applying drop or select (if supplied)
Allocating 1 column slots (1 - 0 dropped)
Read 2 rows. Exactly what was estimated and allocated up front
   0.000s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.000s (  0%) Count rows (wc -l)
   0.000s (  0%) Column type detection (100 rows at 10 points)
   0.000s (  0%) Allocation of 2x1 result (xMB) in RAM
   0.000s (  0%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.001s        Total

这可能有助于分析使用12位数字代码读取变量的问题。

R data.table fread（）没有完全引入整个文本文件

2 个答案: