Question

我使用 SVM 分类算法对文章进行分类，但我很困惑，为什么模型的准确性随着样本数量的增加而下降。以下是我使用不同的超参数 C 值捕获的准确度统计数据。

谁能帮助我通过增加样本数量进一步提高准确性？

正在执行的任务：

<块引用>

对体育、政治、娱乐等各种类别的新闻文章进行分类

示例代码 >>

# dummy data (in real problem this is a big list)
glbl_list = [['samples_id','sample_prediction','sample_article'],['samples_id','sample_prediction','sample_article']]

# shuffling the whole data
order = [j for j in range(len(glbl_list))]
random.shuffle(glbl_list)
glbl_list = [glbl_list[i] for i in order]

# creating a dataframe out of above list
Corpus = pd.DataFrame(glbl_list, columns = ['id', 'label', 'text'])

# perform various data cleaning task like stopwords,punc. marks removal etc

tfidf = TfidfVectorizer()
x = tfidf.fit_transform(Corpus['text'].values)

encoder = LabelEncoder() 
y = encoder.fit_transform(Corpus['label'])

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)

clf = svm.SVC(C=1, kernel='linear')
clf.fit(x_train, y_train)

print("Accuracy: {}".format(clf.score(x_train, y_train)))
print("Accuracy: {}".format(clf.score(x_test, y_test)))

<块引用>

捕获的准确度（在最佳 C 值下）：
样品：(500)
当 C = 1
Train_Accuracy：0.9933333333333333
测试精度：0.98

样本：(1000)
当 C = 2.5
Train_Accuracy: 0.9833333333333333
测试精度：0.9

样本：(5000)
当 C = 1
Train_Accuracy：0.8533333333333334
测试精度：0.81

样本：(8000)
当 C = 1
Train_Accuracy：0.89833333333333333
测试精度：0.8475

样本：(10000)
当 C = 1
Train_Accuracy：0.7488888888888889
测试精度：0.69

样本：(12000)
当 C = 0.6
Train_Accuracy：0.6548148148148148
测试精度：0.6333333333333333

使用分类算法进行文本分类时如何提高模型准确率

0 个答案: