根据另一个属性用概率填充缺失值

时间:2019-05-10 22:44:45

标签: python pandas

我想根据已知实例的概率分布用另一个属性的条件填充缺失值。具体来说:

Weather_Conditions         | Road_Surface | Date_Month
----------
Fine without high winds    | NaN          | 9
Fine without high winds    | NaN          | 1
Raining without high winds | Wet/Damp     | 6
Fine without high winds    | Wet/Damp     | 1
Fine without high winds    | NaN          | 2
Fine without high winds    | NaN          | 1
Raining without high winds | Wet/Damp     | 7
Raining without high winds | Wet/Damp     | 1

如果月份是一月,则所有缺少的Road_Surface值都应以1:3 Frost:Wet的比率填充。

到目前为止,我设法创建了要填充的值的数组

road_values_jan = np.random.choice(["Frost/Ice", "Wet/Damp"], random_data["Road_Surface_Conditions"][random_data['Date_Month'].isin(["01"])].isnull().sum(), p=[0.25, 0.75])

# which outputs:
array(['Wet/Damp', 'Frost/Ice'], dtype='<U9')

当我希望它将其绑定到原始数​​据帧时出现问题。我尝试过

null_road = random_data["Road_Surface_Conditions"][random_data['Date_Month'].isin(["01"])].isnull()

random_data.loc['null_road'] = np.random.choice(road_values_jan, road_values_jan.size)

from this thread,但它说: ValueError:无法设置列不匹配的行

我也玩过

random_data["Road_Surface_Conditions"][random_data['Date_Month'].isin(["01"])] = random_data["Road_Surface_Conditions"][random_data['Date_Month'].isin(["01"])].fillna(pandas.Series(road_values_jan, index=random_data.index))

但是这个给了我 ValueError:传递的值的长度是2,索引暗示8

在Month条件下,如何将这两个值数组附加到NaN值?

请在下面找到.csv样式的数据

Weather_Conditions,Road_Surface_Conditions,Date_Month
Fine without high winds,NaN,9
Fine without high winds,NaN,1
Raining without high winds,Wet/Damp,6
Fine without high winds,Wet/Damp,1
Fine without high winds,NaN,2
Fine without high winds,NaN,1
Raining without high winds,Wet/Damp,7
Raining without high winds,Wet/Damp,1

1 个答案:

答案 0 :(得分:0)

如果我对您的理解正确,则可以首先创建一个分布为25:75且数组大小与您的NaN大小相同的数组,然后选择{{1 }}列,并用创建的数组填充它们:

NaN

Road_Surface_Conditions

注意,我的数据框称为m = (df['Road_Surface_Conditions'].isnull() & df['Date_Month'].eq(1)).sum() s = np.random.choice(['Frost/Ice', 'Wet/Damp'], p=[0.25, 0.75], size = m) print(s) ['Wet/Damp' 'Frost/Ice'] 而不是df.loc[df['Road_Surface_Conditions'].isnull() & df['Date_Month'].eq(1), 'Road_Surface_Conditions'] = s print(df) Weather_Conditions Road_Surface_Conditions Date_Month 0 Fine without high winds NaN 9 1 Fine without high winds Wet/Damp 1 2 Raining without high winds Wet/Damp 6 3 Fine without high winds Wet/Damp 1 4 Fine without high winds NaN 2 5 Fine without high winds Frost/Ice 1 6 Raining without high winds Wet/Damp 7 7 Raining without high winds Wet/Damp 1