循环通过R中的聚合数据

时间:2017-02-28 14:02:34

标签: r dataframe aggregate na

我试图在特定于数据框的列中插入缺失值。

我的目的是通过其他专栏小组替换它。

我已使用aggregate保存汇总结果:

# Replace LotFrontage missing values by Neighborhood mean
lot_frontage_by_neighborhood = aggregate(LotFrontage ~ Neighborhood, combined, mean)

现在我想实现这样的事情:

for key, group in lot_frontage_by_neighborhood:
    idx = (combined["Neighborhood"] == key) & (combined["LotFrontage"].isnull())
    combined[idx, "LotFrontage"] = group.median() 

这当然是一个python代码。

不确定如何在R中实现这一点,有人可以帮忙吗?

例如:

Neighborhood  LotFrontage
     A            20
     A            30
     B            20
     B            50
     A           <NA>

NA记录应替换为25(邻域A中所有记录的平均LotFrontage)

由于

1 个答案:

答案 0 :(得分:1)

这是你想要的想法吗?您可能需要which()函数来确定哪些行具有NA值。

set.seed(1)
Neighborhood = sample(letters[1:4], 10, TRUE)
LotFrontage = rnorm(10,0,1)
LotFrontage[sample(10, 2)] = NA

# This data frame has 2 columns. LotFrontage column has 10 missing values.
df = data.frame(Neighborhood = Neighborhood, LotFrontage = LotFrontage)

# Sets the missing values in the Neighborhood column to the mean of the LotFrontage values from the rows with that Neighborhood
x<-df[which(is.na(df$LotFrontage)),]$Neighborhood
f<-function(x) mean(df[(df$Neighborhood==x),]$LotFrontage, na.rm =TRUE)
df[which(is.na(df$LotFrontage)),]$LotFrontage <- lapply(x,f)