在python中对文本分类进行过采样?

时间:2020-05-30 07:03:57

标签: python machine-learning scikit-learn oversampling

我有一个要分类的文本数据框。但是我需要先进行过采样。请在下面找到示例数据:

df=[['I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am not going to class today','I am not going to class today','I am not going to class today','I am not going to class today'],['Positive','Positive','Positive','Positive','Positive','Positive','Positive','Positive','Positive','Positive','Negative','Negative','Negative','Negative']]
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['Features','Class']
df
          Features                       Class
0   I am going to class today       Positive
1   I am going to class today       Positive
2   I am going to class today       Positive
3   I am going to class today       Positive
4   I am going to class today       Positive
5   I am going to class today       Positive
6   I am going to class today       Positive
7   I am going to class today       Positive
8   I am going to class today       Positive
9   I am going to class today       Positive
10  I am not going to class today   Negative
11  I am not going to class today   Negative
12  I am not going to class today   Negative
13  I am not going to class today   Negative

oversample = RandomOverSampler(sampling_strategy='minority')
# fit and apply the transform
X_over, y_over = oversample.fit_resample(df['Features'], df['Class'])
# summarize class distribution
print(Counter(y_over))

但这不起作用,并且给了我ValueError: Expected 2D array, got 1D array instead:。如何对该数据进行超采样?

1 个答案:

答案 0 :(得分:0)

我发现了问题。我需要重塑数据。

X_over, y_over = oversample.fit_resample(df['Features'].values.reshape(-1,1), df['Class'])

现在正在工作。

Counter({'Positive': 10, 'Negative': 10})