我正在尝试使数据平衡,因为我的目标变量具有多类,并且我想对其进行超采样以使数据平衡

时间:2019-11-15 07:17:15

标签: python machine-learning data-science

print(x)' “ x”是自变量。

    Restaurant  Cuisines    Average_Cost    Rating  Votes   Reviews Area
    0   3.526361    0.693147    5.303305    1.504077    2.564949    1.609438    7.214504
    1   1.386294    4.127134    4.615121    1.504077    2.484907    1.609438    5.905362
    2   2.772589    1.386294    5.017280    1.526056    4.605170    3.433987    6.131226
    3   3.912023    2.833213    5.525453    1.547563    5.176150    4.564348    7.643483
    4   3.526361    2.708050    5.303305    1.435085    5.948035    5.046646    6.126869
    ... ... ... ... ... ... ... ...
    11089   3.912023    0.693147    5.525453    1.648659    5.789960    5.046646    3.135494
    11090   1.386294    6.028279    4.615121    1.526056    3.610918    2.833213    7.643483
    11091   1.386294    2.397895    4.615121    1.504077    3.828641    2.944439    5.814131
    11092   1.386294    6.028279    4.615121    1.410987    3.218876    2.302585    5.905362
    11093   1.386294    6.028279    4.615121    1.029619    0.000000    0.000000    5.564520
    11094 rows × 7 columns
`print(y.value_counts()) `

此处“ y”是目标变量,并且具有多个类别。

    30 minutes     7406
    45 minutes     2665
    65 minutes      923
    120 minutes      62
    20 minutes       20
    80 minutes       14
    10 minutes        4
    Name: Delivery_Time, dtype: int64

在研究了目标变量之后,我们可以看到“ 30分钟”课程在其他课程中的得分更高。

FOR FOR MAKING THINGS BALANCE I TRIED SMOTEtomek to oversamplemy data and make it balance. Below are the codes provide and got error.

from imblearn.combine import SMOTEtomek
smk = SMOTEtomek(ratio = 1)
x_res, y_res = smk.fit_sample(x,y)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-426e8b86623d> in <module>()
      1 from imblearn.combine import SMOTETomek
      2 smk = SMOTETomek(ratio = 1)
----> 3 x_res, y_res = smk.fit_sample(x,y)

2 frames
/usr/local/lib/python3.6/dist-packages/imblearn/utils/_validation.py in _sampling_strategy_float(sampling_strategy, y, sampling_type)
    311     if type_y != 'binary':
    312         raise ValueError(
--> 313             '"sampling_strategy" can be a float only when the type '
    314             'of target is binary. For multi-class, use a dict.')
    315     target_stats = _count_class_sample(y)

ValueError: "sampling_strategy" can be a float only when the type of target is binary. For multi-class, use a dict.

2 个答案:

答案 0 :(得分:1)

您只能看到Smote的实际实现: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/utils/_validation.py#L355

您只需要传递错误中提到的字典即可。但是SMOTE算法在内部负责多类设置。

要做:

from imblearn.oversampling import SMOTE
smote=SMOTE("minority")
X,Y=smote.fit_sample(x_train,y_train)
When dict, the keys correspond to the targeted classes. The
values correspond to the desired number of samples for each targeted
class.

答案 1 :(得分:0)

我认为您应该将目标变量保持在相同的比例,因为SMOTE可以为您提供更好的测试数据集,并且结果更好,但是该模型可能无法从用户输入的新数据(实时数据)中失败。 / p>

由您决定是否应用SMOTE。您可以使用以下代码:

from imblearn.oversampling import SMOTE
smote=SMOTE("minority")
X,Y=smote.fit_sample(x_train_data,y_train_data)