将缺失值替换为数据帧子集的平均值

时间:2019-04-28 17:51:19

标签: r

我有一个名为final_project_data的数据框,具有以下结构。它包括17列,其中包含与县/州和年份相对应的数据。例如,2006年阿拉巴马州的鲍德温县人口为69162,失业率为4.2%等。

ID          County       State       Population   Year    Ump.Rate Fertility  
<dbl>       <chr>        <chr>       <dbl>        <dbl>   <dbl>    <dbl>
1003    Baldwin County   Alabama     69162        2006     4.2     88
1015    Calhoun County   Alabama     112903       2006     2.4     na
1043    Baldwin County   Alabama     na           2007     1.9     71
1049    Calhoun County   Alabama     68014        2007     na      90
1050    CountyY          Alaska      2757         2006     3.9     na
1070    CountyZ          Alaska      11000        2006     7.8     95
1081    CountyY          Alaska      na           2007     6.5     70
1082    CountyZ          Alaska      67514        2007     4.5     60

其中有许多列缺少值,我正尝试用给定的State和Year的平均值代替。我遇到了尝试遍历具有缺失值的每一列,然后遍历年份和行的每个子集以均值填充缺失值的问题。到目前为止,我的代码如下:

#get list of unique states
states <- unique(final_project_data$State)
#get list of columns with na in them - we will use this to impute missing 
values
list_na <- colnames(final_project_data)[ apply(final_project_data, 2, anyNA) ]

list_na
#create a place to hold the missing values
average_missing <- c()

#Loop through each state to impute the missing values with the mean
for(i in 1:length(states)){
 average_missing <- apply(final_project_data[which(final_project_data$State == states[i]),colnames(final_project_data) %in% list_na], 2, mean, na.rm =  TRUE) 
 }
average_missing

但是,当我运行上面的代码时,对于缺少值的每一列,我只会得到一组值,而对于每种状态,都不会得到不同的值。我也不确定如何将其扩展到包括几年。任何帮助或建议,将不胜感激!

2 个答案:

答案 0 :(得分:0)

在for循环中:

dt <- data.frame(
  ID = c(1003, 1015, 1043, 1049, 1050, 1070, 1081, 1082, NA, NA),
  State = c(rep("Alabama", 4), rep("Alaska", 4), "Alabama", "Alaska"),
  Population = c(sample(10000:100000, 8, replace = T), NA, NA),
  Year = c(2006, 2006, 2007, 2007, 2006, 2006, 2007, 2007, 2007, 2006),
  Unemployment = c(sample(1:5, 8, replace = T), NA, NA)
)

# index through each row in data frame
for (i in 1:nrow(dt)){

# if Population variable is NA
  if(is.na(dt$Population[i]) == T){ 

# calculate mean from all Population variables with the same State and Year as index
    dt$Population[i] <- mean(dt$Population[which(dt$State == dt$State[i] & dt$Year == dt$Year[i])], na.rm = T)
  }

# repeat for Unemployment variable
  if(is.na(dt$Unemployment[i]) == T){ 
    dt$Unemployment[i] <- mean(dt$Unemployment[which(dt$State == dt$State[i] & dt$Year == dt$Year[i])], na.rm = T)
  }
}

答案 1 :(得分:0)

这是dplyr版本,没有循环。只需在vars()内添加要转换的所有列即可:

your_data %>%
  group_by(State, Year) %>%
  mutate_at(vars(Population, Ump.Rate, Fertility),
            ~ ifelse(is.na(.), mean(., na.rm = TRUE), .))