替换R中的分类变量中的缺失数据

时间:2014-08-11 10:36:40

标签: r function na

我不确定如何编写一个函数来替换一系列分类向量中的NA数据。

请考虑以下内容:我有一个包含NA数据的分类向量,我想根据现有数据的比例替换NA数据。

例如,

a<-factor(c("yes","no","no","yes","yes","yes","no","yes","yes","yes","yes","yes",NA, NA))

我写了以下代码:

a[is.na(a)]<-sample(c("yes","no"),sum(is.na(a)),replace=TRUE,
prob=c(sum(na.omit(a=="yes"))/sum(!is.na(a)),sum(na.omit(a=="no"))/sum(!is.na(a)))) 

## replace NA with yes or no according to the proportion of yes/no in the non-NA data

上面的代码工作正常,但现在我有一个包含许多分类变量的数据框。 例如:

a<-c("yes","no","no","yes","yes","yes","no","yes","yes","yes","yes","yes",NA, NA)
b<-c("red","blue","white","red","blue","red","blue","red","blue","red","blue",NA,NA,NA)
c<-c(1,3,2,1,2,3,1,2,3,1,2,3,NA,NA)

a<-as.factor(a)   ## ensure the vectors are treated as categorical variable
b<-as.factor(b)
c<-as.factor(c)

df<-data.frame(a=a,b=b,c=c)

我正在努力编写一个函数,允许我替换此类数据帧中所有分类变量中的NA数据。请注意,每个变量可能有两个以上的类别。

1 个答案:

答案 0 :(得分:1)

我会创建一些辅助函数并执行以下操作

helperFunc <- function(x){
  sample(levels(x), sum(is.na(x)), replace = TRUE,
         prob = as.numeric(table(x))/sum(!is.na(x)))   
}

df[sapply(df, is.na)]  <- unlist(sapply(df, helperFunc))

测试一些随机种子(例如,123)

set.seed(123)
df[sapply(df, is.na)]  <- unlist(sapply(df, helperFunc))
df
#      a     b c
# 1  yes   red 1
# 2   no  blue 3
# 3   no white 2
# 4  yes   red 1
# 5  yes  blue 2
# 6  yes   red 3
# 7   no  blue 1
# 8  yes   red 2
# 9  yes  blue 3
# 10 yes   red 1
# 11 yes  blue 2
# 12 yes   red 3
# 13 yes  blue 2
# 14  no white 3