根据与另一列匹配的行子集将NA替换为均值?

时间:2016-07-31 19:30:28

标签: r dataframe

我有数据,每行包含一个人的性别和体重(以磅为单位):

genders <- c("FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "MALE", "MALE", "MALE", "MALE")
weights <- c(110.0, 120.0, 112.0, NA, NA, 190.0, 202.0, 195.0, NA)

df <- data.frame(gender=genders, weight=weights)
df
#   gender weight
# 1 FEMALE    110
# 2 FEMALE    120
# 3 FEMALE    112
# 4 FEMALE     NA
# 5 FEMALE     NA
# 6   MALE    190
# 7   MALE    202
# 8   MALE    195
# 9   MALE     NA

对于weight列中包含NA的每一行,我想用weight均值替换/归置NA,但应仅使用与匹配的行计算均值与具有NA的行相同的gender值。

具体来说,第4行和第5行的gender为FEMALE,weight为NA。我想用在与FEMALE的weight匹配的行子集上计算的平均值gender替换NA。在这种情况下,平均值将是(110 + 120 + 112)/3=114.0与其他行1,2和3。

同样,我想将第9行中的NA替换为MALE gender的权重平均值。

我尝试了以下命令,但它用两个性别的所有用户的平均权重替换NA,这不是我想要的。

df$weight[is.na(df$weight)] <- mean(subset(df, gender=df$gender)$weight, na.rm=T)
df
#   gender   weight
# 1 FEMALE 110.0000
# 2 FEMALE 120.0000
# 3 FEMALE 112.0000
# 4 FEMALE 154.8333
# 5 FEMALE 154.8333
# 6   MALE 190.0000
# 7   MALE 202.0000
# 8   MALE 195.0000
# 9   MALE 154.8333

我搜索了其他问题,但它们与我的问题并不完全相同:

Replace NA with mean matching the same ID

How to replace NA with mean by subset in R (impute with plyr?)

How to replace NA values in a table for selected columns? data.frame, data.table

4 个答案:

答案 0 :(得分:7)

您可以ave()使用replace()(或标准手动更换)。

df$weight <- with(df, ave(weight, gender,
    FUN = function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))))

给出了

  gender   weight
1 FEMALE 110.0000
2 FEMALE 120.0000
3 FEMALE 112.0000
4 FEMALE 114.0000
5 FEMALE 114.0000
6   MALE 190.0000
7   MALE 202.0000
8   MALE 195.0000
9   MALE 195.6667

答案 1 :(得分:4)

您可以按gender对数据框进行分组,然后计算权重的平均值,并将NA替换为ifelse语句,在dplyr中,它可以是:< / p>

library(dplyr)
df %>% 
      group_by(gender) %>% 
      mutate(weight = ifelse(is.na(weight), mean(weight, na.rm = T), weight))

# Source: local data frame [9 x 2]
# Groups: gender [2]

#  gender   weight
#  <fctr>    <dbl>
# 1 FEMALE 110.0000
# 2 FEMALE 120.0000
# 3 FEMALE 112.0000
# 4 FEMALE 114.0000
# 5 FEMALE 114.0000
# 6   MALE 190.0000
# 7   MALE 202.0000
# 8   MALE 195.0000
# 9   MALE 195.6667

答案 2 :(得分:2)

使用基础R这似乎是您正在寻找的:

df$weight[df$gender=="FEMALE" & is.na(df$weight)] <- mean(df$weight[df$gender=="FEMALE"], na.rm=TRUE)
df$weight[df$gender=="MALE" & is.na(df$weight)] <- mean(df$weight[df$gender=="MALE"], na.rm=TRUE)

> df
  gender   weight
1 FEMALE 110.0000
2 FEMALE 120.0000
3 FEMALE 112.0000
4 FEMALE 114.0000
5 FEMALE 114.0000
6   MALE 190.0000
7   MALE 202.0000
8   MALE 195.0000
9   MALE 195.6667

答案 3 :(得分:1)

可以使用na.aggregate中的zoo轻松完成此操作。将'data.frame'转换为'data.table'(setDT(df)),按'性别'分组,我们将na.aggregate应用于'weight',用{{1}替换NA元素价值。默认情况下,mean会返回na.aggregate,但我们也可以将mean参数更改为FUNmedian等。

sum

library(data.table) library(zoo) setDT(df)[, weight := na.aggregate(weight) , by = gender]

中的ave
base R