删除data.table中的因子级别

时间:2014-05-04 00:10:56

标签: r data.table

公共数据集包含因子级别(例如,"(0)省略"),我想将其重新编码为NA。理想情况下,我希望能够一次擦洗整个子集。我正在使用data.table软件包,我想知道是否有更好或更快的方法来完成此操作,而不是将值转换为字符,删除字符,然后将数据转换为因子。

library(data.table)
DT <- data.table(V1=factor(sample(LETTERS,size = 2000000,replace=TRUE)),
                V2 = factor(sample(LETTERS,size = 2000000,replace=TRUE)),
                V3 = factor(sample(LETTERS,size = 2000000,replace=TRUE)))

# Convert to character
DT1 <- DT[, lapply(.SD, as.character)]
DT2 <- copy(DT1)
DT3 <- copy(DT) # Needs to be factor

# Scrub all 'B' values
DT1$V1[DT1$V1=="B"] <- NA
# Works!

DT2[V1 == "B", V1 := NA]
# Warning message:
#   In `[.data.table`(DT, V1 == "B", `:=`(V1, NA)) :
#   Coerced 'logical' RHS to 'character' to match the column's type. Either change the target column to 'logical' first (by creating a new 'logical' vector length 26 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.

identical(DT1,DT2)
# [1] TRUE

# First attempt at looping over data.table
cnames <- colnames(DT3)
system.time(for(cname in cnames) {
  DT3[ ,
      cname := gsub("B", NA, DT3[[cname]]),
      with=FALSE]
})
# user  system elapsed 
# 4.258   0.128   4.478 

identical(DT1$V1,DT3$V1)
# [1] TRUE

# Back to factors
DT3 <- DT3[, lapply(.SD, as.factor)]

2 个答案:

答案 0 :(得分:2)

将因子水平设置为NA:

levels(DT$V1)[levels(DT$V1) == 'B'] <- NA

示例:

> d <- data.table(l=factor(LETTERS[1:3]))
> d
   l
1: A
2: B
3: C
> levels(d$l)[levels(d$l) == 'B'] <- NA
> d
    l
1:  A
2: NA
3:  C
> levels(d$l)
[1] "A" "C"

答案 1 :(得分:2)

您可以按如下方式更改级别:

for (j in seq_along(DT)) {
    x  = DT[[j]]
    lx = levels(x)
    lx[lx == "B"] = NA
    setattr(x, 'levels', lx)      ## reset levels by reference
}