Question

我正在进行文本分类，并且我的数据非常不平衡，例如

Category | Total Records
Cate1    | 950
Cate2    |  40
Cate3    |  10

现在我想对Cate2和Cate3进行超采样，以便它至少具有400-500条记录，我更喜欢使用SMOTE而非随机采样，代码

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(fewRecords['text'],
                                   fewRecords['category'])

sm = SMOTE(random_state=12, ratio = 1.0)
x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

它无法工作，因为它无法生成示例合成文本，现在当我将其隐蔽到类似

的矢量中时

count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(fewRecords['category'])

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(X_train)
ytrain_train =  count_vect.transform(y_train)

我不确定在分类后要预测真实类别时是正确的方法以及如何将向量转换为真实文本

Answer 1

我知道这个问题已经问了 2 年多了，我希望你能找到解决办法。如果您仍然感兴趣，这可以通过 imblearn 管道轻松完成。

我将假设您将使用与 sklearn 兼容的估计器来执行分类。假设多项式朴素贝叶斯。

请注意我是如何从 imblearn 而不是 sklearn 导入 Pipeline 的

from imblearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

像您在代码中所做的那样导入 SMOTE

from imblearn.over_sampling import SMOTE

按照您在代码中所做的那样进行训练-测试分割

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(fewRecords['text'],
                               fewRecords['category'],stratify=fewRecords['category'], random_state=0
)

创建一个以 SMOTE 作为组件之一的管道

textclassifier =Pipeline([
  ('vect', CountVectorizer()),
   ('tfidf', TfidfTransformer()),
   ('smote', SMOTE(random_state=12)),
   ('mnb', MultinomialNB(alpha =0.1))
])

在训练数据上训练分类器

textclassifier.fit(X_train, y_train)

然后您可以将此分类器用于任何任务，包括评估分类器本身、预测新观察等。

例如预测一个新样本

 textclassifier.predict(['sample text'])

将返回预测类别。

要获得更准确的模型，请尝试将词向量作为特征或更方便，在管道上执行超参数优化。

Answer 2

您需要首先将文本文档转换为固定长度的数字矢量，然后再执行所需的任何操作。尝试LDA或Doc2Vec。

SMOTE，Python中文本分类的过采样

2 个答案: