我试图复制以前的工作,其中AdaBoost分类器用于两类分类问题。有75个输入维度但每个弱学习者仅存在于一个维度。
所有输入值都在[0,1]中,每个弱分类器将此区间分为五个子区间:[0,0.2],[0.2,0.4],...,[0.8,1.0] 。
在每个子区间内,弱学习者计算所有样本的权重之和,这些样本具有标签" + 1", a ,以及具有标签" -1", b 的所有样品的重量。
当我们看到一段测试数据时,我们检查它落入哪个子区间,每个维度中的弱学习者输出0.5log(( a + epsilon)/(b + epsilon ))其中 epsilon << A,B
AdaBoost是一种迭代算法,对于每组样本权重,算法需要在每个维度中构建如上所述的弱分类器。对于每个弱分类器,我们计算: -
Z =所有子间隔的总和(此子间隔中+1标记点的总重量*此子间隔中-1标记点的总重量)^ 0.5
当每个标签的总重量的每个子间隔内存在很大差异时,即当我们设计的弱分类器可以在+1标签和-1标签之间很好地区分时,该值将很小。选择成为强分类器一部分的下一个弱分类器是弱分类器,对于这组样本权重,其具有最小的Z值。然后,我们更新样本权重并选择一个新的最佳弱分类器。
我想我可以使用BaseEstimator实现弱分类器的基础:
import numpy as np
from sklearn.base import BaseEstimator
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
from scipy import stats
import matplotlib.pyplot as plt
import math
class QuantileMajorityClassifier(BaseEstimator):
def __init__(self,demo_param='demo'):
self.demo_param = demo_param
def fit(self, X, y, sample_weight):
""" Assume the input X is 1D because it will be when we use it later"""
epsilon = 0.000001
print type(epsilon)
X = np.reshape(X,(X.shape[0],1))
y = np.reshape(y,(y.shape[0],1))
X,y = check_X_y(X,y) # Check X and y have the correctshape
X = np.reshape(X,(X.shape[0],1))
y = np.reshape(y,(y.shape[0],1))
self.classes_ = unique_labels(y) # store the labels seen during fit
self.X_ = X
self.y_ = y
if max(X)>1.0:
print "Error - Max X > 1.0"
if min(X)<0.0:
print "Error - Min X < 0.0"
X1Indices = np.where(X<0.2)
X2Indices = np.where((X<0.4) & (X>=0.2))
X3Indices = np.where((X<0.6) & (X>=0.4))
X4Indices = np.where((X<0.8) & (X>=0.6))
X5Indices = np.where((X<=1.0) & (X>=0.8))
y1=y[X1Indices[0]]
y1PlusIndices = np.where(y1==1)
y1MinusIndices = np.where(y1==-1)
y1PlusWeights = sum(sample_weight[X1Indices[0]][y1PlusIndices])+epsilon
y1MinusWeights = sum(sample_weight[X1Indices[0]][y1MinusIndices])+epsilon
y1Fraction = y1PlusWeights/y1MinusWeights
y1Output = 0.5*math.log(y1Fraction)
y2 = y[X2Indices[0]]
y2PlusIndices = np.where(y2 == 1)
y2MinusIndices = np.where(y2 == -1)
y2PlusWeights = sum(sample_weight[X2Indices[0]][y2PlusIndices]) + epsilon
y2MinusWeights = sum(sample_weight[X2Indices[0]][y2MinusIndices]) + epsilon
y2Fraction = y2PlusWeights / y2MinusWeights
y2Output = 0.5 * math.log(y2Fraction)
y3 = y[X3Indices[0]]
y3PlusIndices = np.where(y3 == 1)
y3MinusIndices = np.where(y3 == -1)
y3PlusWeights = sum(sample_weight[X3Indices[0]][y3PlusIndices]) + epsilon
y3MinusWeights = sum(sample_weight[X3Indices[0]][y3MinusIndices]) + epsilon
y3Fraction = y3PlusWeights / y3MinusWeights
y3Output = 0.5 * math.log(y3Fraction)
y4 = y[X4Indices[0]]
y4PlusIndices = np.where(y4 == 1)
y4MinusIndices = np.where(y4 == -1)
y4PlusWeights = sum(sample_weight[X4Indices[0]][y4PlusIndices]) + epsilon
y4MinusWeights = sum(sample_weight[X4Indices[0]][y4MinusIndices]) + epsilon
y4Fraction = y4PlusWeights / y4MinusWeights
y4Output = 0.5 * math.log(y4Fraction)
y5 = y[X5Indices[0]]
y5PlusIndices = np.where(y5 == 1)
y5MinusIndices = np.where(y5 == -1)
y5PlusWeights = sum(sample_weight[X5Indices[0]][y5PlusIndices]) + epsilon
y5MinusWeights = sum(sample_weight[X5Indices[0]][y5MinusIndices]) + epsilon
y5Fraction = y5PlusWeights / y5MinusWeights
y5Output = 0.5 * math.log(y5Fraction)
self.Class1Output_=y1Output
self.Class2Output_=y2Output
self.Class3Output_=y3Output
self.Class4Output_=y4Output
self.Class5Output_=y5Output
return self
def predict(self,X):
X = np.reshape(X,(X.shape[0],1))
check_is_fitted(self,['X_','y_']) # check that fit has been called
X = check_array(X) # input validation (just copying from GitHub)
if max(X)>1.0:
print "Error - Max X > 1.0"
if min(X)<0.0:
print "Error - Min X < 0.0"
NumberOfTests=X.shape[0]
TestOutput=np.empty((NumberOfTests,1));
print TestOutput.T
for i in range(0,NumberOfTests):
if X[i] < 0.2:
TestOutput[i]=self.Class1Output_
elif X[i][0] < 0.4 and X[i][0]>=0.2:
TestOutput[i]=self.Class2Output_
elif X[i][0] < 0.6 and X[i][0]>=0.4:
TestOutput[i]=self.Class3Output_
elif X[i] < 0.8 and X[i]>=0.6:
TestOutput[i]=self.Class4Output_
elif X[i] <= 1.0 and X[i]>=0.8:
TestOutput[i]=self.Class5Output_
return TestOutput
如果我只是直接传递一维数据,我可以按照我的要求行事,尽管我在这个阶段稍微不确定我想从predict()返回什么。
我的问题实际上是在我来的时候: -
TrainingInput=np.arange(20).reshape(10,2)/20.0 # Just some arbitrary data
TrainingLabel = np.random.choice([-1, 1], size=(10,1), p=[1./3, 2./3])
Classifier = AdaBoostClassifier(QuantileMajorityClassifier(demo_param='test'),algorithm="SAMME",n_estimators=2)
Classifier.fit(TrainingInput,TrainingLabel)
我(不出所料)得到以下错误:
Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/ProjectWork/n-lasr-
project/MajorityClassifierTester.py", line 39, in <module>
Classifier.fit(TrainingInput,TrainingLabel)
File "C:\Python27\lib\site-packages\sklearn\ensemble\weight_boosting.py",
line 413, in fit
return super(AdaBoostClassifier, self).fit(X, y, sample_weight)
File "C:\Python27\lib\site-packages\sklearn\ensemble\weight_boosting.py",
line 145, in fit
random_state)
File "C:\Python27\lib\site-packages\sklearn\ensemble\weight_boosting.py",
line 477, in _boost
random_state)
File "C:\Python27\lib\site-packages\sklearn\ensemble\weight_boosting.py",
line 541, in _boost_discrete
estimator.fit(X, y, sample_weight=sample_weight)
File "C:\Users\Administrator\Desktop\ProjectWork\n-lasr-
project\QuantileMajorityClassifier.py", line 19, in fit
X = np.reshape(X,(X.shape[0],1))
File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 232,
in reshape
return _wrapfunc(a, 'reshape', newshape, order=order)
File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 57, in
_ wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 20 into shape (10,1)
我如何告诉AdaBoost弱分类器只能应用于单个维度?
非常感谢!