简介
我正在开发一个二进制分类任务,其中包含非常不平衡的数据集(大约1000个类1的实例,大约1000个类0的实例),我正在试验imbalanced-learn库来为分类算法创建平衡样本。
我的想法是使用装袋来创建几个来自数据集的平衡样本训练的分类器。不平衡学习库提供了BalancedBaggingClassifier方法,该方法与scikit-learn中的BaggingClassifier类似,但应自动创建输入数据的平衡样本。
BalancedBaggingClassifier的参数'ratio'应该控制袋装分类器的数据采样方式:
比率:str,dict或callable,可选(默认='自动')
用于重新采样数据集的比率。
如果str,必须是以下之一:(i)'少数':重新取样少数民族; (ii)'多数':重新抽样多数类,(iii)'不是少数': 对少数民族的所有类别进行重新抽样,(iv)“全部”:重新抽样 所有类,和(v)'auto':对应'for'与for 过采样方法和欠采样方法的“非少数”。 目标类将被过度采样或欠采样以实现 相同数量的样本与多数或少数类别。 如果是dict,则键对应于目标类。这些值对应于所需的样本数。 如果是可调用的,则取y并返回一个dict。键对应于目标类。值对应于 所需的样本数量。
我对此的解释是,让我们说我想创建一个样本,其中包含来自类1的1000个实例和1000个0类实例,我将设置比率='多数',然后类1将不会被采样,但是class 0我会得到每个袋装分类器的新样本。然而,这不是它似乎如何工作。相反,无论我如何设置参数'ratio'和'max_samples',在我看来,所有类总是被采样,并且结果样本在类标签方面不平衡。
最小工作示例
下面我尝试制作一个最小的工作示例:
import numpy as np
from sklearn.linear_model import LogisticRegression
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn import preprocessing
from imblearn.under_sampling import RandomUnderSampler
#dataset size
N_features = 1
N_samples_0 = 20
N_samples_1 = 10
X = np.zeros(shape=(N_samples_0+N_samples_1,N_features))
y = np.zeros(shape=(N_samples_0+N_samples_1,1))
#parameters for classifier
n_estimators = 1
max_samples = 1.0
ratio = 'majority'
replacement = False
n_jobs = 1
seed = 123
#initialise classifiers
model = BalancedBaggingClassifier(base_estimator=LogisticRegression(penalty='l2',tol=0.01,class_weight='balanced',verbose=1,n_jobs=1),n_estimators=n_estimators, max_samples=max_samples, ratio=ratio, replacement=replacement, n_jobs=n_jobs,verbose=1000, random_state=seed)
print model
#create imbalanced training set
mus_0 = [0+i for i in range(0,N_features)]
sigmas_0 = [0.5 for i in range(0,N_features)]
mus_1 = [max(mus_0)+1+i for i in range(0,N_features)]
sigmas_1 = [0.1+i for i in range(0,N_features)]
#class 0
for i in range(0,N_features):
X[:N_samples_0,i] = np.random.normal(loc=mus_0[i],scale=sigmas_0[i],size=(N_samples_0))
#class 1
for i in range(0,N_features):
X[N_samples_0:,i] = np.random.normal(loc=mus_1[i],scale=sigmas_1[i],size=(N_samples_1))
y[N_samples_0:,:] = 1
print "original data:"
for i in range(0,N_samples_0+N_samples_1):
print X[i,:],
print y[i]
#scaling and fitting
scaler = preprocessing.StandardScaler().fit(X)
model.fit(scaler.transform(X),y.ravel())
for array in model.estimators_samples_:
print "sample that should be balanced:"
X_sample = X[array]
y_sample = y[array.ravel()]
for i in range(0,X_sample.shape[0]):
print X_sample[i,:],
print y_sample[i]
print "sample size="+str(X_sample.shape[0])
创建以下输出:
BalancedBaggingClassifier(base_estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
solver='liblinear', tol=0.01, verbose=1, warm_start=False),
bootstrap=True, bootstrap_features=False, max_features=1.0,
max_samples=1.0, n_estimators=1, n_jobs=1, oob_score=False,
random_state=123, ratio='majority', replacement=False,
verbose=1000, warm_start=False)
original data:
[ 0.40428533] [ 0.]
[-0.33618956] [ 0.]
[ 0.21882232] [ 0.]
[ 0.12895395] [ 0.]
[ 0.19926973] [ 0.]
[ 0.31555022] [ 0.]
[-0.6319698] [ 0.]
[-0.3728038] [ 0.]
[-0.4149535] [ 0.]
[-0.50807259] [ 0.]
[ 0.23080791] [ 0.]
[-0.39007103] [ 0.]
[-0.78354529] [ 0.]
[-0.49436092] [ 0.]
[-0.5353324] [ 0.]
[ 0.64506565] [ 0.]
[-0.15218196] [ 0.]
[ 0.30436546] [ 0.]
[-0.26640216] [ 0.]
[ 1.06123697] [ 0.]
[ 0.90361633] [ 1.]
[ 0.96685144] [ 1.]
[ 0.82633085] [ 1.]
[ 1.01901688] [ 1.]
[ 1.01441932] [ 1.]
[ 1.09587832] [ 1.]
[ 0.88554739] [ 1.]
[ 0.94325122] [ 1.]
[ 0.99280218] [ 1.]
[ 0.96425554] [ 1.]
Building estimator 1 of 1 for this parallel run (total 1)...
[LibLinear]iter 1 act 7.788e+00 pre 7.105e+00 delta 1.821e+00 f 1.525e+01 |g| 8.874e+00 CG 2
iter 2 act 3.941e-01 pre 3.682e-01 delta 1.821e+00 f 7.461e+00 |g| 1.482e+00 CG 2
iter 3 act 4.836e-03 pre 4.797e-03 delta 1.821e+00 f 7.067e+00 |g| 1.506e-01 CG 2
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished
sample that should be balanced:
[ 0.40428533] [ 0.]
[-0.33618956] [ 0.]
[ 0.21882232] [ 0.]
[ 0.19926973] [ 0.]
[-0.6319698] [ 0.]
[-0.50807259] [ 0.]
[ 0.23080791] [ 0.]
[-0.39007103] [ 0.]
[ 0.64506565] [ 0.]
[-0.15218196] [ 0.]
[ 0.30436546] [ 0.]
[ 0.96685144] [ 1.]
[ 0.82633085] [ 1.]
[ 1.01441932] [ 1.]
[ 1.09587832] [ 1.]
[ 0.88554739] [ 1.]
[ 0.99280218] [ 1.]
[ 0.96425554] [ 1.]
sample size=18
因此,除非我犯了一些愚蠢的错误,否则这些课程似乎不均衡。最重要的是,虽然我设置比率='多数',但两个类似乎都被采样。使用我自己的数据集时,我的代码崩溃得多,我的代码崩溃了,因为BalancedBaggingClassifier创建的样本只包含多数类的实例。
具体问题
如何对BalancedBaggingClassifier进行参数化,以便创建平衡样本(我的意思是样本中0和1类的实例数至少大致相同)?
加分问题
根据'ratio'参数的描述,我解释说BalancedBaggingClassifier可用于对一个类进行欠采样并对另一个类进行过采样。但是从源代码我只能收集它使用RandomUnderSampler。我是否可以进行过采样和欠采样,或者是否误解了文档?