我的数据集包含6个字段和4行,其中包含一些NA值。
a=c(5,6,7,12)
b=c(7,2,3,4)
c=c(8,8,21)
d=c(1,1)
e=c(1,2,5,9)
f=c(20,3,11)
length(c)=4
length(d)=4
length(f)=4
z=data.frame(a,b,c,d,e,f)
a b c d e f
5 7 8 1 1 20
6 2 8 1 2 3
7 3 21 NA 5 11
12 4 NA NA 9 NA
这就是我需要做的事情:对于字段a,c,d和f,如果值小于6或大于12,那么我需要 将其设置为NA。此外,如果该值为alread NA,则无变化(保持NA)。
我可以使用每个字段的ifelse来做到这一点,但我的数据包含数十个字段,所以我想知道是否有效率 这样做的方法。
最终数据是
a b c d e f
NA 7 8 NA 1 NA
6 2 8 NA 2 NA
7 3 NA NA 5 11
12 4 NA NA 9 NA
答案 0 :(得分:5)
Using the column index ('v1'), we subset the dataset 'z', change the elements that are TRUE based on the logical condition to NA
by is.na
.
v1 <- c('a', 'c', 'd', 'f')
is.na(z[v1]) <- z[v1] < 6 | z[v1] >12
z
# a b c d e f
#1 NA 7 8 NA 1 NA
#2 6 2 8 NA 2 NA
#3 7 3 NA NA 5 11
#4 12 4 NA NA 9 NA
Or a faster approach as suggested by @DavidArenburg is
z[v1][z[v1] < 6 | z[v1] > 12] <- NA
Or a data.table option by @DavidArenburg. We convert the 'data.frame' to 'data.table' (setDT(z)
), loop through the columns specified in 'v1' and set
the elements that meets the condition to NA
. This would be much faster as the overhead in [.data.table
is avoided.
library(data.table)
setDT(z)
for(j in v1){
set(z, i = which(z[[j]] < 6 | z[[j]] > 12), j = j, value = NA_integer_)
}
z
# a b c d e f
#1: NA 7 8 NA 1 NA
#2: 6 2 8 NA 2 NA
#3: 7 3 NA NA 5 11
#4: 12 4 NA NA 9 NA
答案 1 :(得分:2)
我认为另一种替代方案可以简化语法,而不会降低速度:
z[v1] <- replace(z, z < 6 | z > 12, NA)[v1]
@akrun建议的更有效的变体是将lapply
与replace
结合使用:
z[v1] <- lapply(z[v1], function(x) replace(x, x < 6 | x > 12, NA)
使用5000列,10000行和2500个变量替换的一些基准测试似乎表明在大多数情况下这不会破坏银行,并且lapply
解决方案与其他包(如{{}的竞争非常激烈1}}:
data.table
答案 2 :(得分:1)
Here is another option
library(reshape2)
library(data.table)
df = setDT(melt(as.matrix(z)))
dcast(df[df[, .I[(value<6|value>12) & !X2 %in% c('b', 'e')], by = 1:nrow(df)]$V1,
value := NA], X1 ~ X2, value.var = "value")[, -1, with = FALSE]
# a b c d e f
#1: NA 7 8 NA 1 NA
#2: 6 2 8 NA 2 NA
#3: 7 3 NA NA 5 11
#4: 12 4 NA NA 9 NA