我试图在Weka中进行文本分类,但是我在测试集工作方面遇到了很多问题。这是我的训练集(它很简短,因为我刚开始学习weka!):
@relation sentiment
@attribute phrase string
@attribute value {pos, neg}
@data
'That was really unlucky', neg
'The car crashed horribly', neg
'The culpirit got away',neg
'Fortunally everyone made it out', pos
'She was glad noone was hurt',pos
'And the sun was at least shining',pos
然后我在集合上使用StringToWordVector,然后应用NumericToBinary。这是训练集的最终结果:
@relation 'sentiment-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"-weka.filters.unsupervised.attribute.NumericToBinary'
@attribute value {pos,neg}
@attribute And_binarized {0,1}
@attribute Fortunally_binarized {0,1}
@attribute She_binarized {0,1}
@attribute at_binarized {0,1}
@attribute everyone_binarized {0,1}
@attribute glad_binarized {0,1}
@attribute hurt_binarized {0,1}
@attribute it_binarized {0,1}
@attribute least_binarized {0,1}
@attribute made_binarized {0,1}
@attribute noone_binarized {0,1}
@attribute out_binarized {0,1}
@attribute shining_binarized {0,1}
@attribute sun_binarized {0,1}
@attribute the_binarized {0,1}
@attribute was_binarized {0,1}
@attribute That_binarized {0,1}
@attribute The_binarized {0,1}
@attribute away_binarized {0,1}
@attribute car_binarized {0,1}
@attribute crashed_binarized {0,1}
@attribute culpirit_binarized {0,1}
@attribute got_binarized {0,1}
@attribute horribly_binarized {0,1}
@attribute really_binarized {0,1}
@attribute unlucky numeric
@data
{0 neg,16 1,17 1,25 1,26 1}
{0 neg,18 1,20 1,21 1,24 1}
{0 neg,18 1,19 1,22 1,23 1}
{2 1,5 1,8 1,10 1,12 1}
{3 1,6 1,7 1,11 1,16 1}
{1 1,4 1,9 1,13 1,14 1,15 1,16 1}
我现在开始研究测试集,即:
@relation sentiment
@attribute phrase string
@data
'That was really unlucky'
'The car crashed horribly'
'The culpirit got away'
我希望weka可以将此文本归类为“否定”。为了使它们兼容,我使用与训练集(StringToWordVector和NumericToBinary)相同的过滤器。这是测试集的最终结果:
@relation 'sentiment-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"-weka.filters.unsupervised.attribute.NumericToBinary'
@attribute That_binarized {0,1}
@attribute The_binarized {0,1}
@attribute away_binarized {0,1}
@attribute car_binarized {0,1}
@attribute crashed_binarized {0,1}
@attribute culpirit_binarized {0,1}
@attribute got_binarized {0,1}
@attribute horribly_binarized {0,1}
@attribute really_binarized {0,1}
@attribute unlucky_binarized {0,1}
@attribute was numeric
@data
{0 1,8 1,9 1,10 1}
{1 1,3 1,4 1,7 1}
{1 1,2 1,5 1,6 1}
但是,它给我的错误是训练集和测试集不兼容,我无法弄清楚原因。这直观地看起来像是weka应该理解的东西。
感谢您的帮助!
答案 0 :(得分:1)
您的训练和测试测试应具有相同的标题。现在他们是不同的。
阅读以下链接for an example for text classification.。这是另一个link,它显示了解决此问题的其他方法。