我有一个纵向数据集,其中包含第一次访问时一个人的身高数据。其他行为空。 但是有时候一个人有两个价值观和两个不同价值观。 我想将缺失值替换为组的平均值,并将现有值替换为均值。我尝试过:
data$variable <- ave(data$variable, data$group,
FUN = function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
此代码将缺失值替换为平均高度,但仍保留现有高度。
答案 0 :(得分:1)
我的理解是,缺少的值将由组替换,对于在组中具有重复的ID,那些特定的ID将需要取两个值的平均值。
因此,您需要执行两个功能:
data$variable <- ave(data$variable, data$group,
FUN = function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
data$variable <- ave(data$variable, data$group, data$ID,
FUN = mean)
使用dplyr
语法,您可以这样做:
library(dplyr)
data <- data%>%
group_by(group)%>%
mutate(variable = coalesce(variable, mean(variable, na.rm = TRUE)))%>%
group_by(ID, add = T)%>%
mutate(variable = mean(variable))%>%
ungroup()
还有data.table
:
library(data.table)
setDT(data)
data[, variable := ifelse(is.na(variable), mean(variable, na.rm = T), variable), by = group]
data[, variable := mean(variable), by = .(ID, group)]
答案 1 :(得分:0)
这里是在这种情况下用每个组(物种)的平均值替换缺失值的示例。公认不是最优雅的解决方案。
library(tidyverse)
# creating an example data with NA inserted randomly for 20 values of Petal.Length
set.seed(4)
row_with_na <- sample(1:nrow(iris), 20)
iris[row_with_na, "Petal.Length"] <- NA
# generate the mean of Petal.Length by the Species
ref <- iris %>% group_by(Species) %>% summarise(mean_petal_length = mean(Petal.Length, na.rm=TRUE))
# replace the NA based on Species
iris %>% mutate(Petal.Length = ifelse(is.na(Petal.Length) & Species == "setosa", ref[ref$Species == "setosa", "mean_petal_length"],
ifelse(is.na(Petal.Length) & Species == "versicolor", ref[ref$Species == "versicolor", "mean_petal_length"],
ifelse(is.na(Petal.Length) & Species == "virginica", ref[ref$Species == "virginica", "mean_petal_length"], Petal.Length))))