Question

我正在将训练和测试数据用作Google搜索摘要。

交易数据包含10,060个摘要。每行中的每个代码段以及每个代码段均由单词/术语列表以及末尾的类标签组成。

有 8 个类别标签：

Business,Computers,Culture-Arts,Entertainment,Education-Science,Engineering,Health,Politics-Society,Sports

以下是数据集中的一些行：

manufacture manufacturer directory directory china taiwan products manufacturers directory- taiwan china products manufacturer direcory exporter directory supplier directory suppliers business

empmag electronics manufacturing procurement homepage electronics manufacturing procurement magazine procrement power products production essentials data management business

dfma truecost paper true cost overseas manufacture product design costs manufacturing products china manufacturing redesigned product china save business

如您所见，数据应具有相同数量的维才能使用SVM。

我想用1表示单词是否出现在特定行中，否则用0表示，因此每一行都是0/1向量。但是，会有太多的尺寸。

我的问题：还有其他方法可以预处理数据以便有效执行SVM？

Answer 1

在使用SVM执行文本分类之前，应检查术语权重和功能选择。

默认方法为：

检查tfc术语权重。这是基于所谓的文档反向频率乘以术语频率（在当前文档中）。
检查基于Information Gain的功能选择
在1.和2的基础上转换文档。
使用SVM执行文本分类。

我建议以下出版物以供进一步理解/阅读。在这些出版物中，您将找到研究社区中用于基于SVM的文本分类的典型方法：

Joachims T. (1998) Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec C., Rouveirol C. (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol 1398. Springer, Berlin, Heidelberg
Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. In International Conference on Machine Learning (ICML), 1997.
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988.

SVM中具有过多维度的样本数据

1 个答案: