Question

我正在使用Python的sklearn + xgboost模块解决分类问题。我有一个非常不平衡的数据，其中0％的等级为92％，等级为1％。列车数据集可在此处下载。 http://www.filedropper.com/kangarootrain

我无法在此数据集中使用numclaims和claimcst0变量。此数据集中的变量是： ID，claimcst0，veh_value，曝光，veh_body，veh_age，性别，区域，AGECAT，CLM，numclaims

gender，area和agecat是分类变量，rest是连续变量。 Id是该记录的ID。

十大记录

id,claimcst0,veh_value,exposure,veh_body,veh_age,gender,area,agecat,clm,numclaims
1,0,6.43,0.241897754,STNWG,1,M,A,3,0,0
2,0,4.46,0.856522757,STNWG,1,M,A,3,0,0
3,0,1.7,0.417516596,HBACK,1,M,A,4,0,0
4,0,0.48,0.626974524,SEDAN,4,F,A,6,0,0
5,0,1.96,0.089770031,HBACK,1,F,A,2,0,0
6,0,1.78,0.25654335,HBACK,2,M,A,3,0,0
7,0,2.7,0.688128611,UTE,2,M,A,1,0,0
8,0,0.94,0.912765859,STNWG,4,M,A,2,0,0
9,0,1.98,0.157753423,SEDAN,2,M,A,4,0,0

我尝试了几种方法来预测'clm'，这是我的目标变量。我试过knn，RF，svm，nb。我甚至试图对数据进行二次抽样。但无论我做什么都不能使预测更好。使用树木/助推器，我的准确度达到了93％，但这只是因为我正在预测所有0的正确。

模型错误地将所有1都预测为0。

任何帮助都会非常有帮助。这是我为NB尝试的基本代码。

from sklearn.naive_bayes import GaussianNB

clfnb = GaussianNB()
clfnb.fit(x_train, y_train)
pred = clfnb.predict(x_test)
#print set(pred)
from sklearn.metrics import accuracy_score, confusion_matrix
print accuracy_score(y_test, pred)
print confusion_matrix(y_test, pred)

0.92816091954
[[8398    0]
[ 650    0]]

Answer 1

这是一个非常普遍的挑战，你的两个类别不平衡。要克服仅预测一个类别的问题，您必须使用平衡训练集。有几种解决方案，最基本的是均匀地采样数据。由于你有大约1500个1s样本，你也应该得到1500个0。

n = 1500
sample_yes = data.ix[data.y == 1].sample(n=n, replace=False, random_state=0)
sample_no = data.ix[data.y == 0].sample(n=n, replace=False, random_state=0)
df = pd.concat([sample_yes, sample_no])

data是原始数据框。在将数据拆分为训练和测试集之前，您应该这样做。

Answer 2

如果不是更糟，我确实有这个问题。我找到的一个解决方案是按照以下方式对1进行过采样：

http://www.data-mining-blog.com/tips-and-tutorials/overrepresentation-oversampling/

https://yiminwu.wordpress.com/2013/12/03/how-to-undo-oversampling-explained/

Answer 3

您可以将class_weight参数分配给不平衡数据集。例如，在这种情况下，由于标签1只有8％的数据，因此在进行分类时，标签的权重会更高。

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

class_weight : {dict, ‘balanced’}, optional Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

Answer 4

对于不平衡的数据集，我在Xgboost中使用了“ weights”参数，其中weights是根据数据所属类分配的权重数组。

def CreateBalancedSampleWeights(y_train, largest_class_weight_coef):
classes = np.unique(y_train, axis = 0)
classes.sort()
class_samples = np.bincount(y_train)
total_samples = class_samples.sum()
n_classes = len(class_samples)
weights = total_samples / (n_classes * class_samples * 1.0)
class_weight_dict = {key : value for (key, value) in zip(classes, weights)}
class_weight_dict[classes[1]] = class_weight_dict[classes[1]] * 
largest_class_weight_coef
sample_weights = [class_weight_dict[y] for y in y_train]
return sample_weights

只需通过目标列和最频繁出现的类别的发生率（如果最频繁出现的类别中有100个样本中有75个，则为0.75）

largest_class_weight_coef = 
max(df_copy['Category'].value_counts().values)/df.shape[0]

#pass y_train as numpy array
weight = CreateBalancedSampleWeights(y_train, largest_class_weight_coef)

#And then use it like this
xg = XGBClassifier(n_estimators=1000, weights = weight, max_depth=20)

就这样:) 现在，您的模型将为较少使用的班级数据赋予更大的权重。

机器学习：对不平衡数据的分类

4 个答案: