Question

我正在尝试使用libSVM（C ++版本）找出我做错了什么。

我们需要2行，例如我的原始文本文件：

positive    Wonderful film that mixes documentary and fiction in a way that makes the spectator question: what is the extent of truth in documentary films or is there such a thing as an objective documentary.

negative    An unfunny, unworthy picture which is an undeserving end to Peter Sellers' career. It is a pity this movie was ever made.

从这里我将把它转换成我认为应该是正确的libSVM格式：标签功能：价值特征：价值......

我的标签是0（正面）和1（负面）

0 169:1 192:1 265:1 453:1 510:1 632:1 889:1 1411:3 2092:1 3118:1 3393:1 19363:1 19976:1 26895:1

1 51:1 241:1 864:1 1273:1 1768:1 14912:1 18807:1 25011:1 30846:1

从原始文本到已处理的文本，我删除了常用的停用词（a，the，...）。

然后从我的理解，一个词是一个特征。

如果这句话是第一句：

Great movie

我会将其转换为：

0 1:1 2:1

第100句：

Great actor

0 1:1 234:1

（我以演员的索引234为例）。

我理解的值是句子中单词的出现次数。

所以，如果我有：

great great great

我会得到：

0 1:3

这就是我如何转换。

当我使用Linear在LibSVM中使用此文件时，C从1到10（我尝试了10个值，因为我认为它会改变）。

实际上，我总是得到完全相同的混淆矩阵和51％的准确度。

我是否需要添加更多步骤？改变我转换它的方式？

因为最后我有很多功能（单词）（我的训练文件中超过35k）。

我有什么遗漏吗？

我还使用C＃实现的LibSVM进行了测试，得到了相同的结果。

对于数据集，它是imdb（http://ai.stanford.edu/~amaas/data/sentiment/）

用于情绪分析的LibSVM格式

0 个答案: