Question

考虑以下Pandas DataFrame：

In [1]: df
Out[1]: 
   Col 1     Col 2     Col 3 
0  A         10        64 
1  B          3        17
2  A         12        50
3  A         15        NaN
4  B          2        NaN
5  B          1        22
6  A          9        47
7  B          6        15

在第1列和第2列中，没有缺失值：但是，在第3列中，我们观察到缺少两个值。我们要使用策略=“ most_frequent”;的替代者替换第3列中的NaN。但是，这种冲动不应该基于第3列中的“最常见”的“全局”，它应该取决于第1列中的值（因为我们观察到第3列中第1列中带有“ A”的所有值都倾向于更大，并且反之亦然）。

此过程包含两个元素：

首先，我们只希望将此imputer放入一列（因为第1列和第2列没有缺失值）
第二，我们要训练第1列中有唯一值的数量的impers，然后将这些imp正确地应用于DataFrame的子集，并用正确的most_frequent替换所有NaN。

我尝试了以下操作（可能非常错误）：

for i in df['Col 1'].unique(): # get all unique values in Col 1
    imp = Imputer(strategy='most_frequent')
    mask = df['A'] == i
    df_tmp = df[mask]['Col 3'] # get all values, missing and non-missing, in Col 3 with i as value in Col 1
    print(df_tmp.head()) #This will show that there are some missing values
    new_vals = imp.fit_transform(np.array(df_tmp).reshape(-1, 1)) # for some reason, I have to change the dimensions of my series. How come?
    df_new = pd.DataFrame(new_vals)
    print(df_new.head()) #This will show the same values as the first print, but the NaNs have been replaced correctly. The index and column names, however, have been reset.

从这里开始，我的问题是将新值返回到正确行中的初始DataFrame中，此外，上述使用循环的方法对我来说似乎并不是最聪明的方法。另外，由于未保存这些代入者，因此无法将它们应用于测试数据以在此处填写任何NaN（当然，您可以将所有的代入者保存在dict中（或者可以吗？），但是同样，这似乎没有效率很高）。

我无法在scikit的计算机中找到任何内置功能，而且我一直在网上寻找类似问题的徒劳。

希望有一个聪明的方法可以做到这一点，如果您需要详细说明，请告诉我。

将值插补到特定列，其中插补值取决于DataFrame中的其他值

0 个答案: