Question

IOB（中级，其他，开始）注释格式如何能够像＆＃34; John / B-PERSON Doe / I_PERSON ...＆＃34;转换成其他一些可以用Java消化的格式吗？

无法从斯坦福NLP相关类的文档中找到它：IOBUtils和CoNLLDocumentReaderAndWriter

Answer 1

这是我编写的一个方法，用于显示CoNLLDocumentReaderAndWriter的用法：

public static Iterator<List<CoreLabel>> loadCoNLLDocuments(String filePath) throws IOException{
    SeqClassifierFlags inputFlags = new SeqClassifierFlags();
    inputFlags.entitySubclassification = "noprefix";
    inputFlags.retainEntitySubclassification = true;
    CoNLLDocumentReaderAndWriter rw = new CoNLLDocumentReaderAndWriter();
    rw.init(inputFlags);
    Iterator<List<CoreLabel>> documents = rw.getIterator(IOUtils.readerFromString(filePath));
    return documents;       
}

关键是设置entitySubclassification的位置。在这个例子中，我将标签转换为无前缀样式（例如ORG，PER，MISC）

因此，例如，如果您的输入是在IOB中，并且您将entitySubclassification设置为＆＃34; noprefix＆＃34; ，CoreLabels将删除前缀。

所有选项都在方法entitySubclassify：

中的IOBUtils中列出

iob1，iob2，bio，ioe1，ioe2，io，sbieo，iobes，noprefix，bilou

要对此进行测试，您应该使用一种类型的输入文件，然后尝试将其翻译为另一种类型。然后，您可以查看标记的标记：

// The documents are List<CoreLabel>, token is a CoreLabel

String tokenTag = token.get(CoreAnnotations.AnswerAnnotation.class) ;

当我使用上面的代码打印出来时，我看到了正确的转换！

请注意，输入文件的格式为＆＃34; token \ tner_tag＆＃34;每行，例如＆＃34; John \ tI-PERSON＆＃34;虽然读者也可以处理各种CoNLL样式输入格式，包括原始＆＃34; token \ tpos \ tchunk \ tner_tag＆＃34;每行格式。

Answer 2

    String sent = "When/O the/O last/O time/O you/O ran/O into/O Rick/B-person Ross/I-person and/O Drake/B-person twice/O in/O the/O same/O day/O at/O 2/O diff/O video/O shoot/O locations/O ./O Today/O I/O did/O !/O";

    String[] result = sent.split("/| ");
    for (int x=0; x<result.length; x++) {
        if ( result[x].length() != 0 ) {
            System.out.print(result[x]);
            x++;
            System.out.println("\t" + result[x]);
        }
    }

命名实体识别IOB注释转换

2 个答案: