Question

我的数据集不平衡。它包含以下消息：负34.34％，中性63.95％和正1.71％

我采取的解决方案是使用SMOTE进行过采样。 SMOTE不适用于裸邮件列表（如上述语料库）：它需要向量。因此，在应用SMOTE之前，我在语料库上应用了“ TfidfVectorizer.fit_transform”。因此，现在我分别具有以下形状的X和y：（5850，41369）和（5850，）

应用SMOTE现在将创建X_smote和y_smote。我的数据分配现在达到平衡：“ Counter（{-1：3741，0：3741，1：3741}）”

在第17行，

X = tvect.fit_transform(corpus)

在我的情况下，“语料库”变量应该是什么？在加载或创建模型之后，拟合函数应该应用于哪些变量？（classifier.fit（X_smote，y_smote））???

然后，我应该保存什么，以便下次打开一个新的笔记本时，我在一条新消息上启动“ my_model.predict（）”

复制的代码from

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

corpus = [
    "I am super good with Java and JEE",
    "I am super good with .NET and C#",
    "I am really good with Python and R",
    "I am really good with C++ and pointers"
    ]

classes = ["java developer", ".net developer", "data scientist", "C++ developer"]

test = ["I think I'm a good developer with really good understanding of .NET"]

tvect = TfidfVectorizer(min_df=1, max_df=1)

X = tvect.fit_transform(corpus)

classifier = LinearSVC()
classifier.fit(X, classes)

X_test=tvect.transform(test)
print(classifier.predict(X_test)) # prints ['.net developer']

如何保存nlp分类模型以便以后使用

0 个答案: