我正在一个正在寻找精益Python AutoML管道实现的项目中。根据项目定义,进入管道的数据采用序列化业务对象的格式,例如(人工示例):
property.json:
{
"area": "124",
"swimming_pool": "False",
"rooms" : [
... some information on individual rooms ...
]
}
机器学习目标(例如,基于其他属性预测属性是否具有游泳池)存储在业务对象中,而不是通过单独的标签向量提供,业务对象可能包含不应用于培训的观察结果。< / p>
我需要一个流水线引擎,它支持初始(或以后)流水线步骤,这些步骤包括:i)动态更改机器学习问题中的目标(例如,从输入数据中提取,阈值实值)和ii) 对输入数据进行重新采样(例如,对类进行上采样,下采样,对观察结果进行过滤)。
理想情况下,管道应如下所示(伪代码):
swimming_pool_pipeline = Pipeline([
("label_extractor", SwimmingPoolExtractor()), # skipped in prediction mode
("sampler", DataSampler()), # skipped in prediction mode
("featurizer", SomeFeaturization()),
("my_model", FitSomeModel())
])
swimming_pool_pipeline.fit(training_data) # not passing in any labels
preds = swimming_pool_pipeline.predict(test_data)
管道执行引擎需要满足以下条件:
.fit()
)SwimmingPoolExtractor
从输入的训练数据中提取目标标签,并将其传递给标签(以及自变量); DataSampler()
使用上一步中提取的目标标签对观察值进行采样(例如可以进行少数族裔上采样或过滤观察值); SwimmingPoolExtractor()
不执行任何操作,仅传递输入数据; DataSampler()
不执行任何操作,仅传递输入数据; 例如,假设数据如下:
property.json:
"properties" = [
{ "id_": "1",
"swimming_pool": "False",
...,
},
{ "id_": "2",
"swimming_pool": "True",
...,
},
{ "id_": "3",
# swimming_pool key missing
...,
}
]
SwimmingPoolExtractor()
的应用程序将提取诸如以下内容:
"labels": [
{"id_": "1", "label": "0"},
{"id_": "2", "label": "1"},
{"id_": "3", "label": "-1"}
]
从输入数据中传递并将其设置为机器学习管道的“目标”。
例如,DataSampler()
的应用还可以包括从整个训练数据集中删除任何训练实例的逻辑,这些训练数据不包含任何swimming_pool
键(label = -1
)。 / p>
后续步骤应使用修改后的训练数据(已过滤,不包括使用id_=3
进行的观察)来拟合模型。如上所述,在预测模式下,DataSampler
和SwimmingPoolExtractor
只会通过输入数据
据我所知,neuraxle
和sklearn
(对于我肯定是后者)都没有提供满足所需功能的流水线步骤(根据我到目前为止neuraxle
所收集的内容)鉴于它实现了交叉验证元估计器,因此至少对切片数据的支持)。
我是否缺少某些东西,或者是否有一种方法可以在任一管道模型中实现此类功能?如果不是,那么Python生态系统中列出的库是否有相当成熟的替代方案,并且可以支持此类用例(不包括以这种方式设计管道可能引起的问题)?
答案 0 :(得分:1)
“我是否缺少某些东西,或者有没有办法实现这种功能”
x
发送到管道中的y
中(这样实际上就不会像您想要的那样将任何标签传递给fit
)。提供了以“ fit”形式传递的输入数据是某种可迭代的内容(例如:不要一次传递整个json,至少要做一些可以迭代的内容)。最糟糕的是,传递一个ID列表,然后执行一个步骤,使用一个对象将ID转换为其他对象,例如,该对象可以自己使用json来对传递的ID进行所需的操作。
from neuraxle.pipeline import Pipeline
class SwimmingPoolExtractor(NonFittableMixin, InputAndOutputTransformerMixin, BaseStep): # Note here: you may need to delete the NonFittableMixin from the list here if you encounter problems, and define "fit" yourself rather than having it provided here by default using the mixin class.
def transform(self, data_inputs):
# Here, the InputAndOutputTransformerMixin will pass
# a tuple of (x, y) rather than just x.
x, _ = data_inputs
# Please note that you should pre-split your json into
# lists before the pipeline so as to have this assert pass:
assert hasattr(x, "__iter__"), "input data must be iterable at least."
x, y = self._do_my_extraction(x) # TODO: implement this as you wish!
# Note that InputAndOutputTransformerMixin expects you
# to return a (x, y) tuple, not only x.
outputs = (x, y)
return outputs
class DataSampler(NonFittableMixin, BaseStep):
def transform(self, data_inputs):
# TODO: implement this as you wish!
data_inputs = self._do_my_sampling(data_inputs)
assert hasattr(x, "__iter__"), "data must stay iterable at least."
return data_inputs
swimming_pool_pipeline = Pipeline([
TrainOnlyWrapper(SwimmingPoolExtractor()), # skipped in `.predict(...)` call
TrainOnlyWrapper(DataSampler()), # skipped in `.predict(...)` call
SomeFeaturization(),
FitSomeModel()
])
swimming_pool_pipeline.fit(training_data) # not passing in any labels!
preds = swimming_pool_pipeline.predict(test_data)
fit
的调用:auto_ml = AutoML(
swimming_pool_pipeline,
validation_splitter=ValidationSplitter(0.20), # You can create your own splitter class if needed to replace this one. Dig in the source code of Neuraxle an see how it's done to create your own replacement.
refit_trial=True,
n_trials=10,
epochs=1,
cache_folder_when_no_handle=str(tmpdir),
scoring_callback=ScoringCallback(mean_squared_error, higher_score_is_better=False) # mean_squared_error from sklearn
hyperparams_repository=InMemoryHyperparamsRepository(cache_folder=str(tmpdir))
)
best_swimming_pool_pipeline = auto_ml.fit(training_data).get_best_model()
preds = best_swimming_pool_pipeline.predict(test_data)
如果要使用缓存,则不应定义任何transform
方法,而应定义handle_transform
方法(或相关方法)以保持数据“ ID”的顺序当您对数据重新采样时。我们让Neuraxle处理可迭代的数据,这就是为什么我在上面做了一些断言,以确保您的json已经过预处理的原因,从而使它成为某种东西的列表。