Question

因此，我们在一组15k推文上运行多项式朴素贝叶斯分类算法。我们首先将每条推文分解为基于Weka的StringToWordVector函数的单词特征向量。然后，我们将结果保存到用户的新arff文件作为我们的训练集。我们使用另一组5k推文重复此过程，并使用从我们的训练集派生的相同模型重新评估测试集。

我们想要做的是输出weka在测试集中分类的每个句子及其分类......我们可以看到性能和准确性的一般信息（精确度，召回率，f值）。算法，但根据我们的分类器，我们无法看到由weka分类的单个句子......无论如何都要这样做吗？

另一个问题是，最终我们的教授会给我们多20k条推文，并希望我们对这个新文档进行分类。我们不知道如何做到这一点，但是：

All of the data we have been working with has been classified manually, both the training and test sets...
however the data we will be getting from the professor will be UNclassified... How can we 
reevaluate our model on the unclassified data if Weka requires that the attribute information must
be the same as the set used to form the model and the test set we are evaluating against?

感谢您的帮助！

Answer 1

完成这些任务的最简单方法是使用FilteredClassifier。这种分类器集成了Filter和Classifier，因此您可以将StringToWordVector过滤器与您喜欢的分类器（J48，NaiveBayes，等等连接起来），您将始终保留原始训练集（未处理的文本），并使用StringToWordVector过滤器派生的词汇表将分类器应用于新推文（未处理）。

您可以在＆＃34; Command Line Functions for Text Mining in WEKA＆＃34;的命令行中查看如何执行此操作。并通过＆＃34; A Simple Text Classifier in Java with WEKA＆＃34;。

中的程序

如何从Weka文本分类输出结果文档

1 个答案: