我目前有一个pandas数据框,其中包含来自this kaggle数据集的属性信息。以下是该集合中的示例数据框:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Annadale | 5 | 5425 | 2015 | ... |
| Woodside | 4 | 2327 | 1966 | ... |
| Alphabet City | 1 | 396 | 1985 | ... |
| Alphabet City | 1 | 405 | 1996 | ... |
| Alphabet City | 1 | 396 | 1986 | ... |
| Alphabet City | 1 | 396 | 1992 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
| Alphabet City | 1 | 396 | 1990 | ... |
| Alphabet City | 1 | 396 | 1984 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
我想要做的是将“year built”列中的值等于零的每一行,并将这些行中的“year built”值替换为行中“year built”值的中位数同一个社区,自治市镇和街区。在某些情况下,{neighborhood,borough,block}集合中有多个行在“year built”列中具有零。这在上面的示例数据框中显示。
为了说明问题,我将这两行放在示例数据框中。
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 0 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
要解决这个问题,我想使用具有相同邻域,行政区和块的所有其他行中的“年建”值的平均值来填充具有零的行中的“年建”值“年建”栏目。对于示例行,邻域是Alphabet City,行政区是1,块是396所以我将使用示例数据帧中的以下匹配行来计算平均值:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 1985 | ... |
| Alphabet City | 1 | 396 | 1986 | ... |
| Alphabet City | 1 | 396 | 1992 | ... |
| Alphabet City | 1 | 396 | 1990 | ... |
| Alphabet City | 1 | 396 | 1984 | ... |
我将从这些行(即1987.4)中取出“year built”列的平均值,并用均值替换零。最初有零的行最终看起来像这样:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 1987.4 | ... |
| Alphabet City | 1 | 396 | 1987.4 | ... |
我到目前为止所做的就是在“年建”专栏中删除带有零的行,并找到每个{邻域,区域,块}集的平均年份。原始数据帧存储在raw_data中,它看起来就像本文最顶部的示例数据帧。代码如下所示:
# create a copy of the data
temp_data = raw_data.copy()
# remove all rows with zero in the "year built" column
mean_year_by_location = temp_data[temp_data["YEAR BUILT"] > 0]
# group the rows into {neighborhood, borough, block} sets and take the mean of the "year built" column in those sets
mean_year_by_location = mean_year_by_location.groupby(["NEIGHBORHOOD","BOROUGH","BLOCK"], as_index = False)["YEAR BUILT"].mean()
,输出如下:
| neighborhood | borough | block | year built |
------------------------------------------------
| .... | ... | ... | ... |
| Alphabet City | 1 | 390 | 1985.342 |
| Alphabet City | 1 | 391 | 1986.76 |
| Alphabet City | 1 | 392 | 1992.8473 |
| Alphabet City | 1 | 393 | 1990.096 |
| Alphabet City | 1 | 394 | 1984.45 |
那么如何从mean_year_by_location数据帧中取出那些平均的“年建”值并替换原始raw_data数据帧中的零?
我为这篇长篇大论道歉。我只想非常清楚。
答案 0 :(得分:4)
使用set_index
+ replace
,然后使用fillna
上的mean
。
v = df.set_index(
['neighborhood', 'borough', 'block']
)['year built'].replace(0, np.nan)
df = v.fillna(v.mean(level=[0, 1, 2])).reset_index()
df
neighborhood borough block year built
0 Annadale 5 5425 2015.0
1 Woodside 4 2327 1966.0
2 Alphabet City 1 396 1985.0
3 Alphabet City 1 405 1996.0
4 Alphabet City 1 396 1986.0
5 Alphabet City 1 396 1992.0
6 Alphabet City 1 396 1987.4
7 Alphabet City 1 396 1990.0
8 Alphabet City 1 396 1984.0
9 Alphabet City 1 396 1987.4
<强>详情
首先,设置索引,并用NaN替换0,以便即将进行的mean
计算不受这些值的影响 -
v = df.set_index(
['neighborhood', 'borough', 'block']
)['year built'].replace(0, np.nan)
v
neighborhood borough block
Annadale 5 5425 2015.0
Woodside 4 2327 1966.0
Alphabet City 1 396 1985.0
405 1996.0
396 1986.0
396 1992.0
396 NaN
396 1990.0
396 1984.0
396 NaN
Name: year built, dtype: float64
接下来,计算mean
-
m = v.mean(level=[0, 1, 2])
m
neighborhood borough block
Annadale 5 5425 2015.0
Woodside 4 2327 1966.0
Alphabet City 1 396 1987.4
405 1996.0
Name: year built, dtype: float64
这用作映射,我们将其传递给fillna
。 fillna
因此替换之前引入的NaN,并用索引映射的相应平均值替换它们。完成后,只需重置索引即可恢复原始结构。
v.fillna(m).reset_index()
neighborhood borough block year built
0 Annadale 5 5425 2015.0
1 Woodside 4 2327 1966.0
2 Alphabet City 1 396 1985.0
3 Alphabet City 1 405 1996.0
4 Alphabet City 1 396 1986.0
5 Alphabet City 1 396 1992.0
6 Alphabet City 1 396 1987.4
7 Alphabet City 1 396 1990.0
8 Alphabet City 1 396 1984.0
9 Alphabet City 1 396 1987.4
答案 1 :(得分:2)
我会在mask
中使用groupby.apply
。我这样做只是因为我喜欢它流动的方式。我没有要求它特别快速。尽管如此,这个答案可能会提供一些可能的替代方案。
gidx = ['neighborhood', 'borough', 'block']
def fill_with_mask(s):
mean = s.loc[lambda x: x != 0].mean()
return s.mask(s.eq(0), mean)
df.groupby(gidx)['year built'].apply(fill_with_mask)
0 2015.0
1 1966.0
2 1985.0
3 1996.0
4 1986.0
5 1992.0
6 1987.4
7 1990.0
8 1984.0
9 1987.4
Name: year built, dtype: float64
然后,我们可以使用pd.DataFrame.assign
df.assign(**{'year built': df.groupby(gidx)['year built'].apply(fill_with_mask)})
neighborhood borough block year built
0 Annadale 5 5425 2015.0
1 Woodside 4 2327 1966.0
2 Alphabet City 1 396 1985.0
3 Alphabet City 1 405 1996.0
4 Alphabet City 1 396 1986.0
5 Alphabet City 1 396 1992.0
6 Alphabet City 1 396 1987.4
7 Alphabet City 1 396 1990.0
8 Alphabet City 1 396 1984.0
9 Alphabet City 1 396 1987.4
同样的任务可以通过列分配完成:
df['year built'] = df.groupby(gidx)['year built'].apply(fill_with_mask)
或者
df.update(df.groupby(gidx)['year built'].apply(fill_with_mask))