Question

我对Mahout很新，我正致力于对非结构化文本文档进行分类。

我使用朴素贝叶斯模型时跟随this tutorial。我已经到了训练我的分类器但我不确定如何将新文档转换为tfidf向量进行分类。

我的数据存储为TSV文件，该文件具有标签和与之对应的文本。我使用seq2parse来创建训练模型所需的tfidf向量。

然后，我使用这些tfidf向量训练模型，从而得到朴素贝叶斯模型。

现在我有一个新的未标记文本文档，我希望使用这个训练过的模型进行分类，但我不知道如何将其转换为tfidf向量。如果我再次使用seq2parse，那么它将创建一组新的字典文件等，我认为这并不对应于为训练集创建的字典。

我已经看到了基于已创建的字典文件和标签索引在https://github.com/fredang/mahout-naive-bayes-example/blob/master/src/main/java/com/chimpler/example/bayes/Classifier.java创建tfidf的手动实现，但我想知道Mahout是否已经提供了一些方法来实现这一点，就像他们提供的方式一样seq2parse。我宁愿使用支持方法，而不是手动操作。

Answer 1

示例代码可以帮助你，也许：

org.apache.mahout.math.Vector vector = new RandomAccessSparseVector();
    Integer wordId = dictionary.get(word);  // use hashcode of word

    double tfIdfValue = tfidf.calculate(count, freq.intValue(),
            wordCount, documentCount); // calculate tf*idf

    vector.set(wordId,tfIdfValue);

// Model is a matrix (wordId, labelId) => probability score
NaiveBayesModel model = NaiveBayesModel.materialize(
        new Path(modelPath), configuration);
StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(
        model);

// With the classifier, we get one score for each label.The label with
// the highest score is the one the tweet is more likely to be
// associated to
Vector resultVector = classifier.classifyFull(vector);

Mahout 0.9为朴素贝叶斯分类文件

1 个答案: