Question

假设我在java中构建分类器，它会将注释分类为垃圾邮件。数据集很简单，它有两个属性：字符串注释和名义类。

现在我需要使用StringToWordVector过滤器过滤我的训练数据集。我的第一个问题是测试数据集，如果它被过滤，它将与训练集属性不同。我研究并发现我可以使用批量过滤，如：

    StringToWordVector filter = new StringToWordVector();
    //Here I will set the options, I would be using tf-idf and someothers
    filter.setInputFormat(TrainingData);

现在这种方法是否正确？因此，如果使用此过滤器，两个数据集应该兼容，但它们是否以正确的方式过滤？我担心测试的tf-idf值会以一种会降低精度的方式受到影响。

现在我的主要问题是如何在实践中使用我的分类器？在实践中，我将得到一个单独的注释，这将是字符串我想我会使它成为一个实例，但我如何过滤它以获得tf-idf值来分类呢？!!我想也许我可以将评论添加到原始训练数据集并每次重新计算tf-idf，但是它是如何在实践中完成的？

Answer 1

我试图使用与垃圾邮件分类不同的文本分类任务来回答问题。

说，我有以下培训数据：

"The US government had imposed extra taxes on crude oil", petrolium
"The German manufacturers are observing different genes of Canola oil", non-petrolium

以下测试数据：

"Canada is famous for producing quality corn oil", ?

现在，请考虑您将使用Naive Bayes并使用StringToWordVector过滤器。如果分别对训练和测试数据应用过滤器，则会有两个非常不同的单词向量。训练和测试数据中的每个术语都将成为一项功能，因此您将收到类似“训练和测试数据不兼容”的错误。因此，解决方案是使用FilteredClassifier同时选择分类器（在我们的例子中为Naive Bayes）和过滤器（在我们的例子中为StringToWordVector）。您将需要类似于以下内容的内容：

private NaiveBayes nb;
private FilteredClassifier fc;
private StringToWordVector filter;
private double[] clsLabel;

// Set the filter--->
filter = new StringToWordVector();
filter.setTokenizer(tokenizer); 
filter.setWordsToKeep(1000000); 
filter.setDoNotOperateOnPerClassBasis(true); 
filter.setLowerCaseTokens(true);
filter.setTFTransform(true);
filter.setIDFTransform(true);
filter.setStopwords(stopwords);

filter.setInputFormat(trainingData);    
//<---setting of filter ends

//setting the classifier--->
fc = new FilteredClassifier();
nb = new NaiveBayes();      
fc.setFilter(filter);
fc.setClassifier(nb);
//<---setting of the classifier ends

fc.buildClassifier(trainingData);

//Classification--->
        clsLabel = new double[testData.numInstances()]; //holds class label of the test documents
        //for each test document--->
        for (int i = 0; i < testData.numInstances(); i ++){
            try {
                clsLabel[i] = fc.classifyInstance(testData.instance(i));
            } catch (Exception e) {
                System.out.println("Error from Classification.classify(). Cannot classify instance");
            }
            testData.instance(i).setClassValue(clsLabel[i]);
        }//end for
        //<---classification ends

NB。 TF-IDF训练和测试数据的计算将分开进行。

如何在实践中使用我的文本分类器？截至获取新评论的tf-idf值

1 个答案: