AdaBoost在scikit中学习一维弱学习者

时间:2017-12-07 19:06:54

标签: python scikit-learn classification adaboost ensemble-learning

我试图复制以前的工作,其中AdaBoost分类器用于两类分类问题。有75个输入维度但每个弱学习者仅存在于一个维度。

所有输入值都在[0,1]中,每个弱分类器将此区间分为五个子区间:[0,0.2],[0.2,0.4],...,[0.8,1.0] 。

在每个子区间内,弱学习者计算所有样本的权重之和,这些样本具有标签" + 1", a ,以及具有标签" -1", b 的所有样品的重量。

当我们看到一段测试数据时,我们检查它落入哪个子区间,每个维度中的弱学习者输出0.5log(( a + epsilon)/(b + epsilon ))其中 epsilon << A,B

AdaBoost是一种迭代算法,对于每组样本权重,算法需要在每个维度中构建如上所述的弱分类器。对于每个弱分类器,我们计算: -

Z =所有子间隔的总和(此子间隔中+1标记点的总重量*此子间隔中-1标记点的总重量)^ 0.5

当每个标签的总重量的每个子间隔内存在很大差异时,即当我们设计的弱分类器可以在+1标签和-1标签之间很好地区分时,该值将很小。选择成为强分类器一部分的下一个弱分类器是弱分类器,对于这组样本权重,其具有最小的Z值。然后,我们更新样本权重并选择一个新的最佳弱分类器。

我想我可以使用BaseEstimator实现弱分类器的基础:

import numpy as np
from sklearn.base import BaseEstimator
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
from scipy import stats
import matplotlib.pyplot as plt
import math

class QuantileMajorityClassifier(BaseEstimator):

def __init__(self,demo_param='demo'):
    self.demo_param = demo_param

def fit(self, X, y, sample_weight):
    """ Assume the input X is 1D because it will be when we use it later"""

    epsilon = 0.000001
    print type(epsilon)

    X = np.reshape(X,(X.shape[0],1))
    y = np.reshape(y,(y.shape[0],1))

    X,y = check_X_y(X,y) # Check X and y have the correctshape

    X = np.reshape(X,(X.shape[0],1))
    y = np.reshape(y,(y.shape[0],1))

    self.classes_ = unique_labels(y) # store the labels seen during fit

    self.X_ = X
    self.y_ = y

    if max(X)>1.0:
        print "Error - Max X > 1.0"
    if min(X)<0.0:
        print "Error - Min X < 0.0"

    X1Indices = np.where(X<0.2)
    X2Indices = np.where((X<0.4) & (X>=0.2))
    X3Indices = np.where((X<0.6) & (X>=0.4))
    X4Indices = np.where((X<0.8) & (X>=0.6))
    X5Indices = np.where((X<=1.0) & (X>=0.8))

    y1=y[X1Indices[0]]
    y1PlusIndices = np.where(y1==1)
    y1MinusIndices = np.where(y1==-1)
    y1PlusWeights = sum(sample_weight[X1Indices[0]][y1PlusIndices])+epsilon
    y1MinusWeights = sum(sample_weight[X1Indices[0]][y1MinusIndices])+epsilon
    y1Fraction = y1PlusWeights/y1MinusWeights
    y1Output = 0.5*math.log(y1Fraction)

    y2 = y[X2Indices[0]]
    y2PlusIndices = np.where(y2 == 1)
    y2MinusIndices = np.where(y2 == -1)
    y2PlusWeights = sum(sample_weight[X2Indices[0]][y2PlusIndices]) + epsilon
    y2MinusWeights = sum(sample_weight[X2Indices[0]][y2MinusIndices]) + epsilon
    y2Fraction = y2PlusWeights / y2MinusWeights
    y2Output = 0.5 * math.log(y2Fraction)

    y3 = y[X3Indices[0]]
    y3PlusIndices = np.where(y3 == 1)
    y3MinusIndices = np.where(y3 == -1)
    y3PlusWeights = sum(sample_weight[X3Indices[0]][y3PlusIndices]) + epsilon
    y3MinusWeights = sum(sample_weight[X3Indices[0]][y3MinusIndices]) + epsilon
    y3Fraction = y3PlusWeights / y3MinusWeights
    y3Output = 0.5 * math.log(y3Fraction)

    y4 = y[X4Indices[0]]
    y4PlusIndices = np.where(y4 == 1)
    y4MinusIndices = np.where(y4 == -1)
    y4PlusWeights = sum(sample_weight[X4Indices[0]][y4PlusIndices]) + epsilon
    y4MinusWeights = sum(sample_weight[X4Indices[0]][y4MinusIndices]) + epsilon
    y4Fraction = y4PlusWeights / y4MinusWeights
    y4Output = 0.5 * math.log(y4Fraction)

    y5 = y[X5Indices[0]]
    y5PlusIndices = np.where(y5 == 1)
    y5MinusIndices = np.where(y5 == -1)
    y5PlusWeights = sum(sample_weight[X5Indices[0]][y5PlusIndices]) + epsilon
    y5MinusWeights = sum(sample_weight[X5Indices[0]][y5MinusIndices]) + epsilon
    y5Fraction = y5PlusWeights / y5MinusWeights
    y5Output = 0.5 * math.log(y5Fraction)

    self.Class1Output_=y1Output
    self.Class2Output_=y2Output
    self.Class3Output_=y3Output
    self.Class4Output_=y4Output
    self.Class5Output_=y5Output

    return self

def predict(self,X):

    X = np.reshape(X,(X.shape[0],1))
    check_is_fitted(self,['X_','y_']) # check that fit has been called

    X = check_array(X) # input validation (just copying from GitHub)

    if max(X)>1.0:
        print "Error - Max X > 1.0"
    if min(X)<0.0:
        print "Error - Min X < 0.0"
    NumberOfTests=X.shape[0]

    TestOutput=np.empty((NumberOfTests,1));
    print TestOutput.T

    for i in range(0,NumberOfTests):
        if X[i] < 0.2:
            TestOutput[i]=self.Class1Output_
        elif X[i][0] < 0.4 and X[i][0]>=0.2:
            TestOutput[i]=self.Class2Output_
        elif X[i][0] < 0.6 and X[i][0]>=0.4:
            TestOutput[i]=self.Class3Output_
        elif X[i] < 0.8 and X[i]>=0.6:
            TestOutput[i]=self.Class4Output_
        elif X[i] <= 1.0 and X[i]>=0.8:
            TestOutput[i]=self.Class5Output_

    return TestOutput

如果我只是直接传递一维数据,我可以按照我的要求行事,尽管我在这个阶段稍微不确定我想从predict()返回什么。

我的问题实际上是在我来的时候: -

TrainingInput=np.arange(20).reshape(10,2)/20.0 # Just some arbitrary data
TrainingLabel = np.random.choice([-1, 1], size=(10,1), p=[1./3, 2./3])
Classifier = AdaBoostClassifier(QuantileMajorityClassifier(demo_param='test'),algorithm="SAMME",n_estimators=2)
Classifier.fit(TrainingInput,TrainingLabel)

我(不出所料)得到以下错误:

Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/ProjectWork/n-lasr-
project/MajorityClassifierTester.py", line 39, in <module>
Classifier.fit(TrainingInput,TrainingLabel)
File "C:\Python27\lib\site-packages\sklearn\ensemble\weight_boosting.py", 
line 413, in fit
return super(AdaBoostClassifier, self).fit(X, y, sample_weight)
File "C:\Python27\lib\site-packages\sklearn\ensemble\weight_boosting.py", 
line 145, in fit
random_state)
File "C:\Python27\lib\site-packages\sklearn\ensemble\weight_boosting.py", 
line 477, in _boost
random_state)
File "C:\Python27\lib\site-packages\sklearn\ensemble\weight_boosting.py", 
line 541, in _boost_discrete
estimator.fit(X, y, sample_weight=sample_weight)
File "C:\Users\Administrator\Desktop\ProjectWork\n-lasr-
project\QuantileMajorityClassifier.py", line 19, in fit
X = np.reshape(X,(X.shape[0],1))
File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 232, 
in reshape
return _wrapfunc(a, 'reshape', newshape, order=order)
File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 57, in 
_   wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 20 into shape (10,1)

我如何告诉AdaBoost弱分类器只能应用于单个维度?

非常感谢!

0 个答案:

没有答案