公共数据集包含因子级别(例如,"(0)省略"),我想将其重新编码为NA。理想情况下,我希望能够一次擦洗整个子集。我正在使用data.table
软件包,我想知道是否有更好或更快的方法来完成此操作,而不是将值转换为字符,删除字符,然后将数据转换为因子。
library(data.table)
DT <- data.table(V1=factor(sample(LETTERS,size = 2000000,replace=TRUE)),
V2 = factor(sample(LETTERS,size = 2000000,replace=TRUE)),
V3 = factor(sample(LETTERS,size = 2000000,replace=TRUE)))
# Convert to character
DT1 <- DT[, lapply(.SD, as.character)]
DT2 <- copy(DT1)
DT3 <- copy(DT) # Needs to be factor
# Scrub all 'B' values
DT1$V1[DT1$V1=="B"] <- NA
# Works!
DT2[V1 == "B", V1 := NA]
# Warning message:
# In `[.data.table`(DT, V1 == "B", `:=`(V1, NA)) :
# Coerced 'logical' RHS to 'character' to match the column's type. Either change the target column to 'logical' first (by creating a new 'logical' vector length 26 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
identical(DT1,DT2)
# [1] TRUE
# First attempt at looping over data.table
cnames <- colnames(DT3)
system.time(for(cname in cnames) {
DT3[ ,
cname := gsub("B", NA, DT3[[cname]]),
with=FALSE]
})
# user system elapsed
# 4.258 0.128 4.478
identical(DT1$V1,DT3$V1)
# [1] TRUE
# Back to factors
DT3 <- DT3[, lapply(.SD, as.factor)]
答案 0 :(得分:2)
将因子水平设置为NA:
levels(DT$V1)[levels(DT$V1) == 'B'] <- NA
示例:
> d <- data.table(l=factor(LETTERS[1:3]))
> d
l
1: A
2: B
3: C
> levels(d$l)[levels(d$l) == 'B'] <- NA
> d
l
1: A
2: NA
3: C
> levels(d$l)
[1] "A" "C"
答案 1 :(得分:2)
您可以按如下方式更改级别:
for (j in seq_along(DT)) {
x = DT[[j]]
lx = levels(x)
lx[lx == "B"] = NA
setattr(x, 'levels', lx) ## reset levels by reference
}