用因子列的模式替换缺失,用r替换数字列的均值

时间:2018-04-28 18:17:16

标签: r replace mean missing-data mode

我有以下数据框命名为" train"。列bflag和zfactor是因子,其他2列是数字。我想用模式替换因子列的缺失值,并在同一数据帧中用均值替换数值变量的缺失值。我怎么能在R中做到这一点?

ID   bflag  vcount zfactor vnumber
1     0       12      1       12
2     1       NA      0       8
3     0       3       0       9
4     1       13      0       NA
5     1       2       1       2
6     NA      10      NA      NA

2 个答案:

答案 0 :(得分:2)

在基础R中,您可以迭代列并使用简单的if语句。我们必须为模式定义一个函数,因为基数R不提供一个函数。

df[-1] <- lapply(df[-1], function(x) {
    if(is.factor(x)) replace(x, is.na(x), Mode(na.omit(x)))
    else if(is.numeric(x)) replace(x, is.na(x), mean(x, na.rm=TRUE))
    else x
})

df
#   ID bflag vcount zfactor vnumber
# 1  1     0     12       1   12.00
# 2  2     1      8       0    8.00
# 3  3     0      3       0    9.00
# 4  4     1     13       0    7.75
# 5  5     1      2       1    2.00
# 6  6     1     10       0    7.75

数据和Mode功能:

df <- read.table(text = "ID   bflag  vcount zfactor vnumber
1     0       12      1       12
2     1       NA      0       8
3     0       3       0       9
4     1       13      0       NA
5     1       2       1       2
6     NA      10      NA      NA", 
colClasses = rep(c("numeric", "factor"), length.out=5), 
header = TRUE)

Mode <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
}

Mode借鉴Is there a built-in function for finding the mode?

答案 1 :(得分:2)

dplyr::mutate_if将有助于确定该列所需的列类型和函数/操作(mode/mean)。解决方案将是:

library(dplyr)
df %>% mutate_if(is.numeric, funs(replace(.,is.na(.), mean(., na.rm = TRUE)))) %>%
  mutate_if(is.factor, funs(replace(.,is.na(.), Mode(na.omit(.)))))

#   ID bflag vcount zfactor vnumber
# 1  1     0     12       1   12.00
# 2  2     1      8       0    8.00
# 3  3     0      3       0    9.00
# 4  4     1     13       0    7.75
# 5  5     1      2       1    2.00
# 6  6     1     10       0    7.75

注意: Mode功能取自@RichScrivenMode函数的链接位于(Is there a built-in function for finding the mode?

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}