我有一个包含几列的数据集,其中一列缺少所需的数据块。
缺少数据的列df $ Variable始终归因于特定的人df $ Name。每当df $ Variable中缺少数据时,是否有一种方法可以估算每个人的平均值,而不是整个数据集的平均值?
我一直在使用imputeTS库。
答案 0 :(得分:1)
在没有可重复的示例的情况下,很难做出明确的回答,但是考虑到您的发言,这样的事情应该起作用:
library('tidyverse')
df <- data.frame(Name = c(rep("A", 5), rep("B", 5)),
Variable = sample(c(1, 2, 3, NA), 10, replace = TRUE))
df %>%
group_by(Name) %>%
mutate(non_na_mean = mean(Variable, na.rm = T)) %>%
ungroup() %>%
mutate(newVariable = ifelse(is.na(Variable), non_na_mean, Variable))
答案 1 :(得分:0)
如果没有看到您的数据框,我相信这会起作用。
set.seed(7)
# make some fake data
df <- data.frame(Name = rep(as.character(c("A", "B", "C", "D")), 10), Variable = sample(1:100, 40))
# change some to NA
df[which(df$Variable > 40),"Variable"] <- NA
# Fill in NA's for D with the mean of D
df[which(df$Name == "D" & is.na(df$Variable)),"Variable"] <-
mean(df[which(df$Name == "D"),"Variable"], na.rm = TRUE)
您还可以遍历其他“变量”
variable_vec <- c("A", "B", "C", "D")
for(i in 1:length(variable_vec)){
df[which(df$Name == i & is.na(df$Variable)),"Variable"] <-
mean(df[which(df$Name == i),"Variable"], na.rm = TRUE)
}