创建另一个train.txt来训练其他域的情绪模型

时间:2016-11-15 07:59:50

标签: java nlp stanford-nlp sentiment-analysis

我发现在train.txt中训练情感模型的数据是PTB格式,如下所示。

(3 (2 Yet) (3 (2 (2 the) (2 act)) (3 (4 (3 (2 is) (3 (2 still) (4 charming))) (2 here)) (2 .))))

真正的句子应该是

Yet the act is still charming here.

但是在解析后我得到了不同的结构

(ROOT (S (CC Yet) (NP (DT the) (NN act)) (VP (VBZ is) (ADJP (RB still) (JJ charming)) (ADVP (RB here))) (. .)))

按照我的代码:

public static void main(String args[]){
    // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit,parse");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

    // read some text in the text variable
    String text = "Yet the act is still charming here .";// Add your text here!

    // create an empty Annotation just with the given text
    Annotation annotation = new Annotation(text);

    // run all Annotators on this text

    pipeline.annotate(annotation);

    // these are all the sentences in this document
    // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
    List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);

    // int sentiment = 0;
    for(CoreMap sentence: sentences) {
        // traversing the words in the current sentence
        Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
        System.out.println(tree);
        // System.out.println(tree.yield());
        tree.pennPrint(System.out);
        // Tree tree = sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class);
        // sentiment = RNNCoreAnnotations.getPredictedClass(tree);
    }

    // System.out.print(sentiment);
}

当我使用自己的句子创建train.txt时会出现两个问题。

1.我的树与train.txt中的树不同,我知道后者中的数字是情感极性。但似乎树结构不同,我想得到一个二进制化的解析树,看起来像此

((Yet) (((the) (act)) ((((is) ((still) (charming))) (here)) (.))))

一旦我得到情绪数字,我可以填写它以获得我自己的train.txt

2.如何在二进制化解析树的每个节点上获取所有短语,在本例中,我应该得到

Yet
the 
act
the act
is
still 
charming 
still charming 
is still charming
here
is still charming here
.
is still charming here .
the act is still charming here .
Yet the act is still charming here.

一旦我拿到它们,我就可以花钱给人类注释者注释它们。

其实我用Google搜索了很多,但是无法解决它们,所以我发布在这里。任何有用的答案都会赞赏!

1 个答案:

答案 0 :(得分:2)

将其添加到属性以获取二叉树:

props.setProperty("parse.binaryTrees", "true");

将以这种方式访问​​句子的二叉树:

Tree tree = sentence.set(TreeCoreAnnotations.BinarizedTreeAnnotation.class);

以下是我编写的一些示例代码:

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.Word;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.trees.*;

import java.util.ArrayList;
import java.util.Properties;

public class SubTreesExample {

    public static void printSubTrees(Tree inputTree, String spacing) {
        if (inputTree.isLeaf()) {
            return;
        }
        ArrayList<Word> words = new ArrayList<Word>();
        for (Tree leaf : inputTree.getLeaves()) {
            words.addAll(leaf.yieldWords());
        }
        System.out.print(spacing+inputTree.label()+"\t");
        for (Word w : words) {
            System.out.print(w.word()+ " ");
        }
        System.out.println();
        for (Tree subTree : inputTree.children()) {
            printSubTrees(subTree, spacing + " ");
        }
    }

    public static void main(String[] args) {
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse");
        props.setProperty("parse.binaryTrees", "true");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        String text = "Yet the act is still charming here.";
        Annotation annotation = new Annotation(text);
        pipeline.annotate(annotation);
        Tree sentenceTree = annotation.get(CoreAnnotations.SentencesAnnotation.class).get(0).get(
                TreeCoreAnnotations.BinarizedTreeAnnotation.class);
        System.out.println("Penn tree:");
        sentenceTree.pennPrint(System.out);
        System.out.println();
        System.out.println("Phrases:");
        printSubTrees(sentenceTree, "");

    }
}