我有一个名为final_project_data的数据框,具有以下结构。它包括17列,其中包含与县/州和年份相对应的数据。例如,2006年阿拉巴马州的鲍德温县人口为69162,失业率为4.2%等。
ID County State Population Year Ump.Rate Fertility
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1003 Baldwin County Alabama 69162 2006 4.2 88
1015 Calhoun County Alabama 112903 2006 2.4 na
1043 Baldwin County Alabama na 2007 1.9 71
1049 Calhoun County Alabama 68014 2007 na 90
1050 CountyY Alaska 2757 2006 3.9 na
1070 CountyZ Alaska 11000 2006 7.8 95
1081 CountyY Alaska na 2007 6.5 70
1082 CountyZ Alaska 67514 2007 4.5 60
其中有许多列缺少值,我正尝试用给定的State和Year的平均值代替。我遇到了尝试遍历具有缺失值的每一列,然后遍历年份和行的每个子集以均值填充缺失值的问题。到目前为止,我的代码如下:
#get list of unique states
states <- unique(final_project_data$State)
#get list of columns with na in them - we will use this to impute missing
values
list_na <- colnames(final_project_data)[ apply(final_project_data, 2, anyNA) ]
list_na
#create a place to hold the missing values
average_missing <- c()
#Loop through each state to impute the missing values with the mean
for(i in 1:length(states)){
average_missing <- apply(final_project_data[which(final_project_data$State == states[i]),colnames(final_project_data) %in% list_na], 2, mean, na.rm = TRUE)
}
average_missing
但是,当我运行上面的代码时,对于缺少值的每一列,我只会得到一组值,而对于每种状态,都不会得到不同的值。我也不确定如何将其扩展到包括几年。任何帮助或建议,将不胜感激!
答案 0 :(得分:0)
在for循环中:
dt <- data.frame(
ID = c(1003, 1015, 1043, 1049, 1050, 1070, 1081, 1082, NA, NA),
State = c(rep("Alabama", 4), rep("Alaska", 4), "Alabama", "Alaska"),
Population = c(sample(10000:100000, 8, replace = T), NA, NA),
Year = c(2006, 2006, 2007, 2007, 2006, 2006, 2007, 2007, 2007, 2006),
Unemployment = c(sample(1:5, 8, replace = T), NA, NA)
)
# index through each row in data frame
for (i in 1:nrow(dt)){
# if Population variable is NA
if(is.na(dt$Population[i]) == T){
# calculate mean from all Population variables with the same State and Year as index
dt$Population[i] <- mean(dt$Population[which(dt$State == dt$State[i] & dt$Year == dt$Year[i])], na.rm = T)
}
# repeat for Unemployment variable
if(is.na(dt$Unemployment[i]) == T){
dt$Unemployment[i] <- mean(dt$Unemployment[which(dt$State == dt$State[i] & dt$Year == dt$Year[i])], na.rm = T)
}
}
答案 1 :(得分:0)
这是dplyr
版本,没有循环。只需在vars()
内添加要转换的所有列即可:
your_data %>%
group_by(State, Year) %>%
mutate_at(vars(Population, Ump.Rate, Fertility),
~ ifelse(is.na(.), mean(., na.rm = TRUE), .))