Question

我的数据不平衡，M的百分比为80％，F的百分比为20％。下面是数据示例：

Loop over the line one char at a time
  if in a word
    if a word character
      add the character to current word
    if a separator
      end the current word with a null
  if in separators
    if a word character
      add a new word and make it the current word
      add the character to that word
if in a word
  end the word with a null character

所以我想使用NAME COUNTRY HEIGHT HANDPHONE TYPE GENDER NOVI USA 160 samsung SM-G610F F JOHN JAPAN 181 vivo 1718 M RICHARD UK 175 samsung SM-G532G M ANTHONY UK 179 OPPO F1fw M SAMUEL UK 185 Iphone 8 plus M BUNGA KOREA 170 Iphone 6s F来平衡M：F的百分比为50％：50％的数据。我已经尝试过以下脚本：

SMOTENC

但我越来越出错了：

import numpy as np
import pandas as pd
import scipy.stats as stats
import sklearn
import keras
import imblearn
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

df=pd.read_excel('Data for oversampling.xlsx')
Data = df
Data.GENDER.replace({'M':0,'F':1},inplace=True)
sns.countplot('GENDER', data = Data)
y = Data.GENDER
x = Data.drop('GENDER', axis=1)

from imblearn.over_sampling import SMOTENC
smote_nc = SMOTENC(categorical_features=[0,3], random_state=0)
x_resampled, y_resampled = smote_nc.fit_resample(x, y)

有人可以帮忙吗？

Answer 1

在数据集中，除要素2（唯一的非分类）外，所有要素都是分类的。您需要更新categorical_features列表。

SMOTENC：无法将字符串转换为浮点型

1 个答案: