我有一个熊猫数据框
City State
0 Cambridge MA
1 NaN DC
2 Boston MA
3 Washignton DC
4 NaN MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 NaN FL
11 Washington DC
我想根据最常见的状态填写NaN,如果状态出现在我之前,那么我按州分组并应用以下代码:
df['City'] = df.groupby('State').transform(lambda x:x.fillna(x.value_counts().idxmax()))
如果所有状态都在输出
之前发生,则上述代码适用 City State
0 Cambridge MA
1 Washignton DC
2 Boston MA
3 Washignton DC
4 Cambridge MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washington DC
但是我想添加一个条件,这样如果一个州永远不会发生,那么它的城市将是整个城市列中最常见的,即如果数据框是
City State
0 Cambridge MA
1 NaN DC
2 Boston MA
3 Washignton DC
4 NaN MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 NaN FL
11 Washington DC
12 NaN NY
在我希望输出
之前从未发生过NY City State
0 Cambridge MA
1 Washignton DC
2 Boston MA
3 Washignton DC
4 Cambridge MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washington DC
12 Cambridge NY
上面的代码给出了一个ValueError :('尝试获取空序列的argmax'),因为之前从未发生过“NY”。
答案 0 :(得分:2)
IIUC:
def f(x):
if x.count()<=0:
return np.nan
return x.value_counts().index[0]
df['City'] = df.groupby('State')['City'].transform(f)
df['City'] = df['City'].fillna(df['City'].value_counts().idxmax())
输出:
City State
0 Cambridge MA
1 Washignton DC
2 Cambridge MA
3 Washignton DC
4 Cambridge MA
5 Miami FL
6 Cambridge MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washignton DC
12 Cambridge NY
答案 1 :(得分:0)
您可以通过以下代码解决此问题
mode = df['City'].mode()[0]
df['City'] = df.groupby('State')['City'].apply(lambda x: x.fillna(x.value_counts().idxmax() if x.value_counts().max() >=1 else mode , inplace = False))
df['City']= df['City'].fillna(df['City'].value_counts().idxmax())
输出:
City State
0 Cambridge MA
1 Washignton DC
2 Boston MA
3 Washignton DC
4 Cambridge MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washington DC
12 Cambridge NY