由于强制NA到字符,`fread()`上的数据丢失

时间:2016-03-19 06:57:13

标签: r data.table

我正在使用fread()阅读数据文件。对于某些文件,我有以下情况:

dt1 <- fread('colA colB colC
             A01 NA NA
             A02 NA NA
             A03 NA NA
             A04 NA NA
             A05 NA NA
             A06 NA NA
             A07 bbb NA
             A08 NA ccc
             A09 NA NA
             A10 NA NA
             A11 NA NA
             A12 NA NA
             A13 NA NA
             A14 NA NA
             A15 NA NA
             A16 NA NA
             A17 NA NA
             A18 NA NA
             ')

Bumped column 2 to type character on data row 7, field contains 'bbb'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

dt1
#     colA colB colC
# 1:   A01        NA
# 2:   A02        NA
# 3:   A03        NA
# 4:   A04        NA
# 5:   A05        NA
# 6:   A06        NA
# 7:   A07  bbb   NA
# 8:   A08   NA  ccc
# 9:   A09   NA   NA
# 10:  A10   NA   NA

在生成的data.table中,第一个字符出现之前的colB值是空字符串而不是NA。我事先不知道列名或列号,所以我不能使用colClasses参数。有没有办法解决这个问题(除了使用read.table()而不是fread())?

2 个答案:

答案 0 :(得分:4)

对我的第一个回答发表评论:

fread(DT, colClasses="character")

将所有列都读为字符。单身的标准R recyling。在这种情况下,事先不知道哪个列(通过名称或数字)都有此问题,因此可以将所有字符作为字符读取。

答案 1 :(得分:1)

可以列传递给colClasses

请参阅?fread底部记录的大量示例:

# colClasses
data = "A,B,C,D\n1,3,5,7\n2,4,6,8\n"
fread(data, colClasses=c(B="character",C="character",D="character"))  # as read.csv
fread(data, colClasses=list(character=c("B","C","D")))    # saves typing
fread(data, colClasses=list(character=2:4))     # same using column numbers

# drop
fread(data, colClasses=c("B"="NULL","C"="NULL"))   # as read.csv
fread(data, colClasses=list(NULL=c("B","C")))      # 
fread(data, drop=c("B","C"))      # same but less typing, easier to read
fread(data, drop=2:3)             # same using column numbers

# select
# (in read.csv you need to work out which to drop)
fread(data, select=c("A","D"))    # less typing, easier to read
fread(data, select=c(1,4))        # same using column numbers