Question

我正在寻找

等句子

幼儿教学，心理学学士学位

我使用Stanford Parser对文本进行注释。
然后我迭代每个句子并使用NER（命名实体识别）识别“学士学位”。
通过处理三元组，我可以看到该对象跟随“BE IN”并且可能是大学专业。
所以我发送对象短语进行进一步分析。我的麻烦是我不知道如何分开

幼儿教学

这

心理学

此过程的代码循环遍历对象三元组，并在满足某些POS要求时保留它。

private void processTripleObject(List<CoreLabel> objectPhrase )
{
    try
    {
        StringBuilder sb = new StringBuilder();
        for(CoreLabel token: objectPhrase)
        {
            String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);

            TALog.getLogger().debug("pos: "+pos+"  word "+token.word());
            if(!matchDegreeNameByPos(pos))
            {
                return;
            }

            sb.append(token.word());
            sb.append(SPACE);
        }

        IdentifiedToken itoken = new IdentifiedToken(IdentifiedToken.SKILL, sb.toString());

    }
    catch(Exception e)
    {
        TALog.getLogger().error(e.getMessage(),e);
    }

由于教学与心理学之间的逗号不在代币中，我不知道如何识别分歧。

有人可以提供建议吗？

Answer 1

请注意，如果未找到POS标记，token.get(CoreAnnotations.PartOfSpeechAnnotation.class)将返回令牌。使用CoreNLP 3.7.0和"tokenize ssplit pos"注释器进行测试。然后，您可以检查pos是否在带有您感兴趣的标点符号的字符串中。例如，我刚刚测试过的一些代码：

String punctuations = ".,;!?";
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
    for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
        // pos could be "NN" but could also be ","
        String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
        if (punctuations.contains(pos)) {
            // do something with it
        }
    }
}

斯坦福NLP：保留标点符号？

1 个答案: