我想在Scikit-Learn中创建一条管道,其中的一个特定步骤是离群值检测和消除,从而允许将转换后的数据传递给其他变换器和估计器。
我已经搜索了SE,但在任何地方都找不到此答案。这可能吗?
答案 0 :(得分:-1)
是的。子类化TransformerMixin并构建一个自定义转换器。这是对现有异常值检测方法之一的扩展:
from sklearn.pipeline import Pipeline, TransformerMixin
from sklearn.neighbors import LocalOutlierFactor
class OutlierExtractor(TransformerMixin):
def __init__(self, **kwargs):
"""
Create a transformer to remove outliers. A threshold is set for selection
criteria, and further arguments are passed to the LocalOutlierFactor class
Keyword Args:
neg_conf_val (float): The threshold for excluding samples with a lower
negative outlier factor.
Returns:
object: to be used as a transformer method as part of Pipeline()
"""
try:
self.threshold = kwargs.pop('neg_conf_val')
except KeyError:
self.threshold = -10.0
pass
self.kwargs = kwargs
def transform(self, X):
"""
Uses LocalOutlierFactor class to subselect data based on some threshold
Returns:
ndarray: subsampled data
Notes:
X should be of shape (n_samples, n_features)
"""
x = np.asarray(X)
lcf = LocalOutlierFactor(**self.kwargs)
lcf.fit(X)
return x[lcf.negative_outlier_factor_ > self.threshold, :]
def fit(self, *args, **kwargs):
return self
然后创建一个管道,如下所示:
pipe = Pipeline([('outliers', OutlierExtraction()), ...])