Question

我正在测试openNLP库以实现内容分类的自动化，但我遇到了麻烦。我正在使用这个代码，它总是返回我在训练数据中的第一个类别，我从任何新闻网站传递完整的文章。

    public void trainModel() {
        try {
            InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory( new File("C:\\Users\\emehm\\Desktop\\data\\training_data.txt") );
            ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, "UTF-8");
            ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

            DoccatModel model = DocumentCategorizerME.train("en", sampleStream, TrainingParameters.defaultParams(), new DoccatFactory());
            DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);
            double[] outcomes = myCategorizer.categorize(  new String[]{ this.getFileContent() });
            String category = myCategorizer.getBestCategory(outcomes);
            Map<String, Double> map = myCategorizer.scoreMap(new String[]{ this.getFileContent() });
            System.out.println(category);
        } catch (IOException e) {
            // Failed to read or parse training data, training failed
            e.printStackTrace();
        }
    }

    public String getFileContent() throws IOException {
        InputStream is = new FileInputStream("C:\\Users\\emehm\\Desktop\\data\\statija.txt");
        BufferedReader buf = new BufferedReader(new InputStreamReader(is));
        String line = buf.readLine();
        StringBuilder sb = new StringBuilder();
        while (line != null) {
            sb.append(line).append("\n");
            line = buf.readLine();
        }
        buf.close();
        return sb.toString();
    }

培训数据：http://pastebin.com/ZhxswkvJ

文章我正在使用：http://pastebin.com/xtABGcbh

它总是返回列表中的第一个类别，我想知道我错过了什么？当我调试它时，它返回所有这些的0.25分，并由于某种原因选择它们中的第一个。当我测试一个单词时，它可以正常工作，但它不能用于文章。

Answer 1

输入需要分为单个单词，即按空格分割。

更改此内容：double[] outcomes = myCategorizer.categorize( new String[]{ this.getFileContent() });

：double[] outcomes = myCategorizer.categorize( this.getFileContent().split(" ") );

之后，你应该有更好的结果。值得注意的是，有效性与模型的质量有关。

openNLP分类内容返回总是第一类

1 个答案: