CoreNLP Sentiment training data in wrong format

时间:2017-06-09 12:54:56

标签: stanford-nlp

I'm trying to train my own sentiment analysis model for corenlp. I want to do this in java code (not from the command line), so I copied pieces from https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/sentiment/BuildBinarizedDataset.java to prepare the data, and then copying some pieces from https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/sentiment/SentimentTraining.java to do the actual training. I condensed the code of the former link, lines 171-226 a bit in my own code (to understand what's going on), into the following:

String text = IOUtils.slurpFileNoExceptions(inputPath);
    String[] chunks = text.split("\\n\\s*\\n+"); // need blank line to
    for (String chunk : chunks) {
        if (chunk.trim().isEmpty()) {
            continue;
        }
        String[] lines = chunk.trim().split("\\n");
        String sentence = lines[0];
        StringReader sin = new StringReader(sentence);
        DocumentPreprocessor document = new DocumentPreprocessor(sin);
        document.setSentenceFinalPuncWords(new String[] { "\n" });
        List<HasWord> tokens = document.iterator().next();
        Integer mainLabel = new Integer(tokens.get(0).word());
        tokens = tokens.subList(1, tokens.size());
        Map<Pair<Integer, Integer>, String> spanToLabels = Generics.newHashMap();
        for (int i = 1; i < lines.length; ++i) {
            extractLabels(spanToLabels, tokens, lines[i]);
        }
        Tree tree = parser.apply(tokens);
        Tree binarized = binarizer.transformTree(tree);
        Tree collapsedUnary = transformer.transformTree(binarized);
        if (sentimentModel != null) {
            Trees.convertToCoreLabels(collapsedUnary);
            SentimentCostAndGradient scorer = new SentimentCostAndGradient(sentimentModel, null);
            scorer.forwardPropagateTree(collapsedUnary);
            setPredictedLabels(collapsedUnary);
        } else {
            setUnknownLabels(collapsedUnary, mainLabel);
        }
        Trees.convertToCoreLabels(collapsedUnary);
        collapsedUnary.indexSpans();
        for (Map.Entry<Pair<Integer, Integer>, String> pairStringEntry : spanToLabels.entrySet()) {
            setSpanLabel(collapsedUnary, pairStringEntry.getKey(), pairStringEntry.getValue());
        }

        //trainingTrees.add(collapsedUnary);
        System.out.println("Debugging collaped Unary:" + collapsedUnary);
    }

The println gives me something like:

> Debugging collaped Unary:(ROOT (NP (DT The) (NNS performances)) (@S (VP (VBP are) (ADJP (RB uniformly) (JJ good))) (. .)))

Whereas, from what I understand, it is supposed to look like this (as for the format, sorry for copying another sentence here)):

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2

As explained in https://mailman.stanford.edu/pipermail/java-nlp-user/2013-November/004308.html , stanford corenlp sentiment training set , How to train the Stanford NLP Sentiment Analysis tool , etc.

Nothing happens after these lines in BuildBinarizedDataset. Can someone tell me how to get it into the right format? (hacking something together myself feels quite stupid here, and there must be something I'm missing.)

i.e. the error I get later on, in SentimentTraining, is:

Exception in thread "main" java.lang.NumberFormatException: For input string: "DT"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.valueOf(Integer.java:766)
at edu.stanford.nlp.sentiment.SentimentUtils.attachLabels(SentimentUtils.java:37)
at edu.stanford.nlp.sentiment.SentimentUtils.attachLabels(SentimentUtils.java:33)
at edu.stanford.nlp.sentiment.SentimentUtils.attachLabels(SentimentUtils.java:33)
at edu.stanford.nlp.sentiment.SentimentUtils.readTreesWithLabels(SentimentUtils.java:69)
at edu.stanford.nlp.sentiment.SentimentUtils.readTreesWithGoldLabels(SentimentUtils.java:50)
at de.dkt.eservices.esentimentanalysis.modules.CoreNLPSentimentAnalyzer.trainModel(CoreNLPSentimentAnalyzer.java:251)
at de.dkt.eservices.esentimentanalysis.modules.CoreNLPSentimentAnalyzer.main(CoreNLPSentimentAnalyzer.java:306)

which makes sense, given that it expects a number, but gets the label of the node in the tree...

Would be grateful for any pointers here!

1 个答案:

答案 0 :(得分:0)

Haven没有找到真正的解决方案,但万一有其他人遇到这个问题,以下是诀窍:

public static Tree traverseTreeAndChangePosTagsToNumbers(Tree tree) {

    for (Tree subtree : tree.getChildrenAsList()) {
        if (subtree.label().toString().matches("\\D+")) { 
            subtree.label().setValue("2");

        }if (Integer.parseInt(subtree.label().toString())<0||Integer.parseInt(subtree.label().toString())>4){
            subtree.label().setValue("2");
        }
        if (!(subtree.isPreTerminal())) {
            traverseTreeAndChangePosTagsToNumbers(subtree);
        }
    }

    return tree;
}

不是一个不错的解决方案,因为它不承认提供情绪范围的选项(即在树中注释子词,因为副词的数量总是2(中性)),所以情绪总是基于值对于整个句子/树,但至少它摆脱了语法错误。