我是自然语言处理(NLP)的新手,我想做词性标注(POS),然后在文本中找到特定的结构。我可以使用Stanford-NLP管理POS标记,但我不知道如何提取这个结构:
NN/NNS + IN + DT + NN/NNS/NNP/NNPS
public static void main(String args[]) throws Exception{
//input File
String contentFilePath = "";
//outputFile
String triplesFilePath = contentFilePath.substring(0, contentFilePath.length()-4)+"_postagg.txt";
//document to POS tagging
String content = getFileContent(contentFilePath);
Properties props = new Properties();
props.setProperty("annotators","tokenize, ssplit, pos");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Annotate the document.
Annotation doc = new Annotation(content);
pipeline.annotate(doc);
// Annotate the document.
List<CoreMap> sentences = doc.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
String word = token.get(CoreAnnotations.TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
System.out.println(word + "/" + pos);
} }}}
答案 0 :(得分:1)
您可以简单地迭代您的句子并检查POS标签。如果它们符合您的要求,您可以提取此结构。代码可能如下所示:
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
List<CoreLabel> tokens = sentence.get(TokensAnnotation.class);
for(int i = 0; i < tokens.size() - 3; i++) {
String pos = tokens.get(i).get(PartOfSpeechAnnotation.class);
if(pos.equals("NN") || pos.equals("NNS")) {
pos = tokens.get(i + 1).getString(PartOfSpeechAnnotation.class);
if(pos.equals("IN")) {
pos = tokens.get(i + 2).getString(PartOfSpeechAnnotation.class);
if(pos.equals("DT")) {
pos = tokens.get(i + 3).getString(PartOfSpeechAnnotation.class);
if(pos.contains("NN")) {
//We have a match starting at index i and ending at index i + 3
String word1 = tokens.get(i).getString(TextAnnotation.class);
String word2 = tokens.get(i + 1).getString(TextAnnotation.class);
String word3 = tokens.get(i + 2).getString(TextAnnotation.class);
String word4 = tokens.get(i + 3).getString(TextAnnotation.class);
System.out.println(word1 + " " + word2 + " " + word3 + " " + word4);
}
}
}
}
}
}