我使用 SVM 分类算法对文章进行分类,但我很困惑,为什么模型的准确性随着样本数量的增加而下降。以下是我使用不同的超参数 C 值捕获的准确度统计数据。
谁能帮助我通过增加样本数量进一步提高准确性?
正在执行的任务:
<块引用>对体育、政治、娱乐等各种类别的新闻文章进行分类
示例代码 >>
# dummy data (in real problem this is a big list)
glbl_list = [['samples_id','sample_prediction','sample_article'],['samples_id','sample_prediction','sample_article']]
# shuffling the whole data
order = [j for j in range(len(glbl_list))]
random.shuffle(glbl_list)
glbl_list = [glbl_list[i] for i in order]
# creating a dataframe out of above list
Corpus = pd.DataFrame(glbl_list, columns = ['id', 'label', 'text'])
# perform various data cleaning task like stopwords,punc. marks removal etc
tfidf = TfidfVectorizer()
x = tfidf.fit_transform(Corpus['text'].values)
encoder = LabelEncoder()
y = encoder.fit_transform(Corpus['label'])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
clf = svm.SVC(C=1, kernel='linear')
clf.fit(x_train, y_train)
print("Accuracy: {}".format(clf.score(x_train, y_train)))
print("Accuracy: {}".format(clf.score(x_test, y_test)))
<块引用>
捕获的准确度(在最佳 C 值下):
样品:(500)
当 C = 1
Train_Accuracy:0.9933333333333333
测试精度:0.98
样本:(1000)
当 C = 2.5
Train_Accuracy: 0.9833333333333333
测试精度:0.9
样本:(5000)
当 C = 1
Train_Accuracy:0.8533333333333334
测试精度:0.81
样本:(8000)
当 C = 1
Train_Accuracy:0.89833333333333333
测试精度:0.8475
样本:(10000)
当 C = 1
Train_Accuracy:0.7488888888888889
测试精度:0.69
样本:(12000)
当 C = 0.6
Train_Accuracy:0.6548148148148148
测试精度:0.6333333333333333