我正在处理bigmart数据集,实际上,我想根据另一列的值替换一列的缺失值:
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 NaN 0-1000
4 High 0-1000
... ... ...
8518 High 2000-3000
8519 NaN 0-1000
8520 Small 1000-2000
8521 Medium 1000-2000
8522 Small 0-1000
So if train[“Outlet_Size”] value is a NaN and train[“sales_bin”] is “0-1000”
train[“Outlet_Size”] value shoud become “Small”
else == Medium
但是我真的不知道该怎么写,我发现的所有信息似乎让我感到困惑
有可能做到吗?怎么样?
非常感谢
答案 0 :(得分:1)
使用Series.isna
创建布尔掩码,然后使用np.where
+ Series.eq
根据{{1}的条件从Small
和Medium
中选择选项}}等于sales_bin
:
0-1000
结果:
m = df['Outlet_Size'].isna()
df.loc[m, 'Outlet_Size'] = np.where(df.loc[m, 'sales_bin'].eq('0-1000'), 'Small', 'Medium')
答案 1 :(得分:1)
您可以使用pandas.Series.map代替numpy.where。
pandas.Series.map对于这些简单情况似乎更方便,这使得使用字典(例如{'0-1000': 'Small', '2000-3000': 'High'}
)的多个插补更加容易和明确。
numpy.where设计用于处理更多逻辑(例如:如果<5则a ^ 2),这在OP用例中不是很有用,但要付出一定的代价,例如使多个插补难以处理(嵌套) if-else)。
步骤:
示例:
import pandas as pd
import numpy as np
fake_dataframe = pd.DataFrame({
'Outlet_Size' : ['Medium', 'Medium', 'Medium', np.nan, 'High', 'High', np.nan, 'Small', 'Medium', 'Small', np.nan, np.nan],
'sales_bin': ['3000-4000', '0-1000', '2000-3000', '0-1000', '0-1000', '2000-3000', '0-1000', '1000-2000', '1000-2000', '0-1000', '2000-3000', '1000-2000']
})
missing_mask = fake_dataframe['Outlet_Size'].isna()
mapping_dict = dict({'0-1000': 'Small'})
fake_dataframe.loc[missing_mask, 'Outlet_Size'] = fake_dataframe.loc[missing_mask, 'sales_bin'].map(mapping_dict)
fake_dataframe['Outlet_Size'] = fake_dataframe['Outlet_Size'].fillna('Medium')
print(fake_dataframe)
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 Small 0-1000
4 High 0-1000
5 High 2000-3000
6 Small 0-1000
7 Small 1000-2000
8 Medium 1000-2000
9 Small 0-1000
10 Medium 2000-3000
11 Medium 1000-2000
具有多个插补的示例:
import pandas as pd
import numpy as np
fake_dataframe = pd.DataFrame({
'Outlet_Size' : ['Medium', 'Medium', 'Medium', np.nan, 'High', 'High', np.nan, 'Small', 'Medium', 'Small', np.nan, np.nan],
'sales_bin': ['3000-4000', '0-1000', '2000-3000', '0-1000', '0-1000', '2000-3000', '0-1000', '1000-2000', '1000-2000', '0-1000', '2000-3000', '1000-2000']
})
missing_mask = fake_dataframe['Outlet_Size'].isna()
mapping_dict = dict({'0-1000': 'Small', '2000-3000': 'High'})
fake_dataframe.loc[missing_mask, 'Outlet_Size'] = fake_dataframe.loc[missing_mask, 'sales_bin'].map(mapping_dict)
fake_dataframe['Outlet_Size'] = fake_dataframe['Outlet_Size'].fillna('Medium')
print(fake_dataframe)
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 Small 0-1000
4 High 0-1000
5 High 2000-3000
6 Small 0-1000
7 Small 1000-2000
8 Medium 1000-2000
9 Small 0-1000
10 High 2000-3000
11 Medium 1000-2000
答案 2 :(得分:0)
遵循Shubham Sharma的建议(使用np.select)并使用该功能 “ Item_Outlet_Sales”而不是“ sales_bin”
所以:
Outlet_Size Item_Outlet_Sales
0 Medium 3735.1380
1 Medium 443.4228
2 Medium 2097.2700
3 NaN 732.3800
4 High 994.7052
... ... ...
8518 High 2778.3834
8519 NaN 549.2850
8520 Small 1193.1136
8521 Medium 1845.5976
8522 Small 765.6700
missing = train["Outlet_Size"].isna()
condlist = [train.loc[missing, "Outlet_Size"] & train.loc[missing,'sales_bin'] <=1000,
train.loc[missing, "Outlet_Size"] & train.loc[missing,'sales_bin'] > 1000]
choicelist = ["Small", "Medium"] #PS, If I got it well it is possible to add as # many contiontions as wanted, as long condlist and choicelist has the same lenght
train.loc[missing, 'Outlet_Size'] = np.select(condlist, choicelist)
train["Outlet_Size"].value_counts(dropna=False)
Small 4798
Medium 2793
High 932
非常感谢您的建议以及这个很棒的论坛:)