我正在尝试在sklearn中设置一个GridSearchCV
,并将TimeSeriesSplit
的数据标准化为 training 上的数据。我要做的是创建一个名为TransformerMixin
的{{1}},该DivisorTransform
获取规范化的除数并将其存储。 DivisorTransform
在Pipeline
之前被实例化。进入管道,我设置了DivisorTransform
(以适应它),然后NormalizeTransformer
以DivisorTransform
作为输入并执行除法。但是,使用进入GridSearchCV
的管道可以腌制变压器。这将导致DivisorTransform
被酸洗和装配,然后NormalizeTransformer
被酸洗,但是本身具有DivisorTransform
,DivisorTransform
被再次酸洗。这导致NormalizeTransformer
使用不适合的DivisorTransform
。
这是一个例子
dt = DivisorTransform()
pipe = Pipeline([('divisor',dt),('normalize',NormalizeTransformer(dt))])
gridS = GridSearchCV(pipe,params={...},cv=TimeSeriesSplit())
如何将不同的规范化管理到GridSearchCV
中?哪些是最佳做法?
这是一个python示例
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
class DivisorTransform(BaseEstimator,TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
print(f'{type(self).__name__} id {id(self)} fit')
self.divisor_ = X.max()
return self
def transform(self, X):
print(f'{type(self).__name__} id {id(self)} transform')
return X
def getDivisor(self):
return self.divisor_
class NormalizationTransform(BaseEstimator,TransformerMixin):
def __init__(self, divisorTransform, fakeParam):
self.divTrns = divisorTransform
self.fakeParam = fakeParam
print(f'{type(self).__name__} id {id(self)} init saving {type(self.divTrns).__name__} at {id(self.divTrns)}')
def fit(self, X, y=None):
print(f'{type(self).__name__} id {id(self)} fit going to fit {type(self.divTrns).__name__} {id(self.divTrns)}')
self.divisor_ = self.divTrns.fit(X).getDivisor()
return self
def transform(self, X):
print(f'{type(self).__name__} id {id(self)} transform')
res = X.copy()
res = res / self.divisor_
print('_______________________________________')
print(res)
return res
def anti_transform(self, X):
res = X.copy()
res = res * self.divisor_
return res
def score(self, X, y=None, sample_weight=None):
return 1
x = pd.DataFrame([[i+j*10 for j in range(3)] for i in range(10)],columns=['A','B','C'])
dvT = DivisorTransform()
print(type(dvT).__name__)
pipe = Pipeline([('divisor',dvT),('normalization',NormalizationTransform(dvT, 0))])
res1 = pipe.fit_transform(x)
params = {'normalization__fakeParam':[0,1]}
gs = GridSearchCV(pipe,params,cv=TimeSeriesSplit(n_splits=3).split(x))
print('Starting Grid Search')
gs.fit(x)
锡生产印刷品:
Starting Grid Search
NormalizationTransform id 140321510292896 init saving NoneType at 94405154462352
NormalizationTransform id 140321722266344 init saving NoneType at 94405154462352
这说明了问题