如何按组均值估算缺失值并替换现有值

时间:2019-08-13 08:54:45

标签: r

我有一个纵向数据集,其中包含第一次访问时一个人的身高数据。其他行为空。 但是有时候一个人有两个价值观和两个不同价值观。 我想将缺失值替换为组的平均值,并将现有值替换为均值。我尝试过:

data$variable <- ave(data$variable, data$group, 
                     FUN = function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))

此代码将缺失值替换为平均高度,但仍保留现有高度。

2 个答案:

答案 0 :(得分:1)

我的理解是,缺少的值将由组替换,对于在组中具有重复的ID,那些特定的ID将需要取两个值的平均值。

因此,您需要执行两个功能:

data$variable <- ave(data$variable, data$group, 
                     FUN = function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))

data$variable <- ave(data$variable, data$group, data$ID,
                     FUN = mean)

使用dplyr语法,您可以这样做:

library(dplyr)

data <- data%>%
  group_by(group)%>%
  mutate(variable = coalesce(variable, mean(variable, na.rm = TRUE)))%>%
  group_by(ID, add = T)%>%
  mutate(variable = mean(variable))%>%
  ungroup()

还有data.table

library(data.table)

setDT(data)
data[, variable := ifelse(is.na(variable), mean(variable, na.rm = T), variable), by = group]
data[, variable := mean(variable), by = .(ID, group)]

答案 1 :(得分:0)

这里是在这种情况下用每个组(物种)的平均值替换缺失值的示例。公认不是最优雅的解决方案。

library(tidyverse)

# creating an example data with NA inserted randomly for 20 values of Petal.Length
set.seed(4)
row_with_na <- sample(1:nrow(iris), 20)
iris[row_with_na, "Petal.Length"] <- NA

# generate the mean of Petal.Length by the Species     
ref <- iris %>% group_by(Species) %>% summarise(mean_petal_length = mean(Petal.Length, na.rm=TRUE))

# replace the NA based on Species

iris %>% mutate(Petal.Length = ifelse(is.na(Petal.Length) & Species == "setosa", ref[ref$Species == "setosa", "mean_petal_length"],
                                      ifelse(is.na(Petal.Length) & Species == "versicolor", ref[ref$Species == "versicolor", "mean_petal_length"],
                                             ifelse(is.na(Petal.Length) & Species == "virginica", ref[ref$Species == "virginica", "mean_petal_length"], Petal.Length))))