如何使用HeidelTime作为UIMA AnalysisEngine和DKPro

时间:2017-04-26 14:54:40

标签: java maven nlp uima dkpro-core

我目前正在开展一个项目,从文字来源中提取传记信息。一步是对源的注释,以查看其中的实际内容。要做到这一点,我想使用HeidelTime,因为它documentation表示它非常适合UIMA管道。因为我还是NLP的初学者,所以我已经涉足DKPro Core Framework,到目前为止已经提供了对我想要的所有组件的方便访问,包括如下所示在管道中包装:

public static void main(String[] args) throws UIMAException, IOException {
    Path inputDir = Paths.get(args[0]);
    String language = args[1];
    String fileForm = String.format("[+]*%s", args[2]);
    Path outputFile = Paths.get(args[3]);

    CollectionReader reader = createReader(TextReader.class,
            TextReader.PARAM_SOURCE_LOCATION, inputDir.toString(),
            TextReader.PARAM_LANGUAGE, language,
            TextReader.PARAM_PATTERNS, new String[]{fileForm});
    AnalysisEngineDescription segmenter = createEngineDescription(StanfordSegmenter.class,
            StanfordSegmenter.PARAM_LANGUAGE, language,
            StanfordSegmenter.PARAM_WRITE_SENTENCE, true,
            StanfordSegmenter.PARAM_WRITE_TOKEN, true
    );
    AnalysisEngineDescription ner = createEngineDescription(StanfordNamedEntityRecognizer.class);
    AnalysisEngineDescription writer = createEngineDescription(TokenizedTextWriter.class,
            TokenizedTextWriter.PARAM_TARGET_LOCATION, outputFile.toString(),
            TokenizedTextWriter.PARAM_OVERWRITE, true,
            TokenizedTextWriter.PARAM_EXTENSION, ".txt"
    );
    runPipeline(reader, segmenter, ner, writer);
}

文档说明HeidelTime的主要分析类实现了必要的接口,所以我添加了它,包括建议的预处理和后处理AnalysisEngines:

public static void main(String[] args) throws UIMAException, IOException {
    Path inputDir = Paths.get(args[0]);
    String language = args[1];
    String fileForm = String.format("[+]*%s", args[2]);
    Path outputFile = Paths.get(args[3]);

    CollectionReader reader = createReader(TextReader.class,
            TextReader.PARAM_SOURCE_LOCATION, inputDir.toString(),
            TextReader.PARAM_LANGUAGE, language,
            TextReader.PARAM_PATTERNS, new String[]{fileForm});
    AnalysisEngineDescription segmenter = createEngineDescription(StanfordSegmenter.class,
            StanfordSegmenter.PARAM_LANGUAGE, language,
            StanfordSegmenter.PARAM_WRITE_SENTENCE, true,
            StanfordSegmenter.PARAM_WRITE_TOKEN, true
    );
    AnalysisEngineDescription ner = createEngineDescription(StanfordNamedEntityRecognizer.class);
    // ======= HeidelTime ======
    AnalysisEngineDescription treeTagger = createEngineDescription(TreeTaggerWrapper.class);
    AnalysisEngineDescription heidelTime = createEngineDescription(HeidelTime.class);
    AnalysisEngineDescription intervalTagger = createEngineDescription(IntervalTagger.class);
    // ======= HeidelTime ======
    AnalysisEngineDescription writer = createEngineDescription(TokenizedTextWriter.class,
            TokenizedTextWriter.PARAM_TARGET_LOCATION, outputFile.toString(),
            TokenizedTextWriter.PARAM_OVERWRITE, true,
            TokenizedTextWriter.PARAM_EXTENSION, ".txt"
    );
    runPipeline(reader, segmenter, ner, treeTagger, heidelTime, intervalTagger, writer);
}

然而,当我运行它时,我遇到以下错误:

1016 [main] WARN  org.apache.uima.resource.metadata.TypeSystemDescription  - [jar:file:/C:/Users/User/.m2/repository/com/github/heideltime/heideltime/2.2.1/heideltime-2.2.1.jar!/desc/type/HeidelTime_TypeSystemStyleMap.xml] is not a type file. Ignoring.
org.apache.uima.util.InvalidXMLException: Invalid descriptor at jar:file:/C:/Users/User/.m2/repository/com/github/heideltime/heideltime/2.2.1/heideltime-2.2.1.jar!/desc/type/HeidelTime_TypeSystemStyleMap.xml.
    at org.apache.uima.util.impl.XMLParser_impl.parse(XMLParser_impl.java:218)
    at org.apache.uima.util.impl.XMLParser_impl.parseTypeSystemDescription(XMLParser_impl.java:729)
    at org.apache.uima.util.impl.XMLParser_impl.parseTypeSystemDescription(XMLParser_impl.java:718)
    at org.apache.uima.fit.factory.TypeSystemDescriptionFactory.createTypeSystemDescription(TypeSystemDescriptionFactory.java:107)
    at org.apache.uima.fit.factory.CollectionReaderFactory.createReader(CollectionReaderFactory.java:213)
    at de.uniba.minf.msc.stemper.corpus.pantheon.Pipeline.main(Pipeline.java:37)
Caused by: org.apache.uima.util.InvalidXMLException: The XML parser encountered an unknown element type: styleMap.
    at org.apache.uima.util.impl.XMLParser_impl.buildObject(XMLParser_impl.java:301)
    at org.apache.uima.util.impl.SaxDeserializer_impl.getObject(SaxDeserializer_impl.java:142)
    at org.apache.uima.util.impl.XMLParser_impl.parse(XMLParser_impl.java:209)
    ... 5 more

HeidelTime组件似乎无法与其他分析引擎正确翻译。文档说它应该,但是存储库中缺少负责类,也可能来自我提取的Maven工件。我不知道从哪里开始寻找解决方案,到目前为止,我没有发现任何暗示在线方向,除了一些老问题如何使用独立here和{{ 3}}

0 个答案:

没有答案