我想编写一个类,该类基于目标变量中用其他变量标识的k个最近邻居来替换一个目标变量中DataFrame的缺失值。 此类将“拟合”火车集上的KNN,稍后再“预测”火车和测试集的缺失值。
此类必须包含在sklearn.Pipeline 中,这意味着它必须包含fit()和transform()函数,这些函数将由管道调用。我找不到找到记录此类的好方法。
到目前为止,我的代码做什么:
我的主要问题是步骤1.a和1.b创建了不应在测试集上“改装”的临时DataFrame。
我需要您的帮助才能以编写方式放置我的代码片段。
到目前为止,这是我的代码:
col = 'native-country' #one specific column where nans should be replaced using KNN
n_neighbors = 3
######
#I guess this block should be in a pipeline so that we transform the test set with the same dict as the train set
######
miss = TreatMissingsWithCommons() #this class replaces numerical nans by mean() and categorical nans by most frequent value
miss.fit(data)
data_full = miss.transform(data)
#One Hot Encode categorical variables to pass the data to KNN
ohe = DummyTransformer()
ohe.fit(data_full)
#OHE categorical features on lines where col is not null
data_ohe_full = ohe.transform(data_full[~data[col].isnull()].drop(col, axis=1))
#Fit the classifier on lines where col is null
if data[col].dtype in ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']:
knn = KNeighborsRegressor(n_neighbors = n_neighbors)
knn.fit(data_ohe_full, data[col][~data[col].isnull()])
else:
knn = KNeighborsClassifier(n_neighbors = n_neighbors)
knn.fit(data_ohe_full, data[col][~data[col].isnull()])
#OHE on lines where col is null, and make the prediction
ohe_nulls = ohe.transform(data_full[data[col].isnull()].drop(col,axis=1))
knn.predict(ohe_nulls)
以下是一些复制帮助:
data = pd.DataFrame({'age': {0: 39,
4: 28,
10777: 53,
14430: 21,
19061: 19,
19346: 39,
24046: 39,
25524: 43,
30902: 18},
'education-num': {0: 13,
4: 13,
10777: 9,
14430: 7,
19061: 8,
19346: 13,
24046: 4,
25524: 10,
30902: 5},
'native-country': {0: 'United-States',
4: 'Cuba',
10777: np.nan,
14430: 'United-States',
19061: 'El-Salvador',
19346: np.nan,
24046: 'Dominican-Republic',
25524: 'United-States',
30902: np.nan},
'workclass': {0: 'State-gov',
4: 'Private',
10777: 'Private',
14430: np.nan,
19061: 'Private',
19346: 'Private',
24046: 'Private',
25524: np.nan,
30902: 'Private'}})
编辑:晚安之后,我澄清了我的想法并得到了解决方案。这很脏,所以我希望就我缺少的良好做法提供一些反馈。
class KnnImputer(TransformerMixin, BaseEstimator):
def __init__(self, target, n_neighbors = 5):
self.col = target
self.n_neighbors = n_neighbors
def fit(self, X, y=None):
#this class replaces numerical nans by mean() and categorical nans by most frequent value
miss = TreatMissingsWithCommons()
miss.fit(X)
self.X_full = miss.transform(X)
#One Hot Encode categorical variables to pass the data to KNN
self.ohe = DummyTransformer()
self.ohe.fit(data_full)
#Create a Dataframe that does not contain any nulls, categ variables are OHE, with all each rows
X_ohe_full = self.ohe.transform(self.X_full[~X[self.col].isnull()].drop(self.col, axis=1))
#Fit the classifier on lines where col is null
if X[self.col].dtype in ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']:
self.knn = KNeighborsRegressor(n_neighbors = self.n_neighbors)
self.knn.fit(X_ohe_full, X[self.col][~X[self.col].isnull()])
else:
self.knn = KNeighborsClassifier(n_neighbors = self.n_neighbors)
self.knn.fit(X_ohe_full, X[self.col][~X[self.col].isnull()])
return self
def transform(self, X, y=None):
#OHE on lines where col is null, and make the prediction
ohe_nulls = self.ohe.transform(self.X_full[X[self.col].isnull()].drop(self.col,axis=1))
#Get prediction for nulls in target
preds = self.knn.predict(ohe_nulls)
## Concatenate non nulls with nulls + target preds
#Nulls + target preds
X_nulls = X[X[self.col].isnull()].drop(self.col,axis=1)
X_nulls[self.col] = preds
X_imputed = pd.concat([X[~X[self.col].isnull()], X_nulls], ignore_index=True)
return X_imputed#should return the dataframe with a full target