我目前正在开展一个项目,从文字来源中提取传记信息。一步是对源的注释,以查看其中的实际内容。要做到这一点,我想使用HeidelTime,因为它documentation表示它非常适合UIMA管道。因为我还是NLP的初学者,所以我已经涉足DKPro Core Framework,到目前为止已经提供了对我想要的所有组件的方便访问,包括如下所示在管道中包装:
public static void main(String[] args) throws UIMAException, IOException {
Path inputDir = Paths.get(args[0]);
String language = args[1];
String fileForm = String.format("[+]*%s", args[2]);
Path outputFile = Paths.get(args[3]);
CollectionReader reader = createReader(TextReader.class,
TextReader.PARAM_SOURCE_LOCATION, inputDir.toString(),
TextReader.PARAM_LANGUAGE, language,
TextReader.PARAM_PATTERNS, new String[]{fileForm});
AnalysisEngineDescription segmenter = createEngineDescription(StanfordSegmenter.class,
StanfordSegmenter.PARAM_LANGUAGE, language,
StanfordSegmenter.PARAM_WRITE_SENTENCE, true,
StanfordSegmenter.PARAM_WRITE_TOKEN, true
);
AnalysisEngineDescription ner = createEngineDescription(StanfordNamedEntityRecognizer.class);
AnalysisEngineDescription writer = createEngineDescription(TokenizedTextWriter.class,
TokenizedTextWriter.PARAM_TARGET_LOCATION, outputFile.toString(),
TokenizedTextWriter.PARAM_OVERWRITE, true,
TokenizedTextWriter.PARAM_EXTENSION, ".txt"
);
runPipeline(reader, segmenter, ner, writer);
}
文档说明HeidelTime的主要分析类实现了必要的接口,所以我添加了它,包括建议的预处理和后处理AnalysisEngines:
public static void main(String[] args) throws UIMAException, IOException {
Path inputDir = Paths.get(args[0]);
String language = args[1];
String fileForm = String.format("[+]*%s", args[2]);
Path outputFile = Paths.get(args[3]);
CollectionReader reader = createReader(TextReader.class,
TextReader.PARAM_SOURCE_LOCATION, inputDir.toString(),
TextReader.PARAM_LANGUAGE, language,
TextReader.PARAM_PATTERNS, new String[]{fileForm});
AnalysisEngineDescription segmenter = createEngineDescription(StanfordSegmenter.class,
StanfordSegmenter.PARAM_LANGUAGE, language,
StanfordSegmenter.PARAM_WRITE_SENTENCE, true,
StanfordSegmenter.PARAM_WRITE_TOKEN, true
);
AnalysisEngineDescription ner = createEngineDescription(StanfordNamedEntityRecognizer.class);
// ======= HeidelTime ======
AnalysisEngineDescription treeTagger = createEngineDescription(TreeTaggerWrapper.class);
AnalysisEngineDescription heidelTime = createEngineDescription(HeidelTime.class);
AnalysisEngineDescription intervalTagger = createEngineDescription(IntervalTagger.class);
// ======= HeidelTime ======
AnalysisEngineDescription writer = createEngineDescription(TokenizedTextWriter.class,
TokenizedTextWriter.PARAM_TARGET_LOCATION, outputFile.toString(),
TokenizedTextWriter.PARAM_OVERWRITE, true,
TokenizedTextWriter.PARAM_EXTENSION, ".txt"
);
runPipeline(reader, segmenter, ner, treeTagger, heidelTime, intervalTagger, writer);
}
然而,当我运行它时,我遇到以下错误:
1016 [main] WARN org.apache.uima.resource.metadata.TypeSystemDescription - [jar:file:/C:/Users/User/.m2/repository/com/github/heideltime/heideltime/2.2.1/heideltime-2.2.1.jar!/desc/type/HeidelTime_TypeSystemStyleMap.xml] is not a type file. Ignoring.
org.apache.uima.util.InvalidXMLException: Invalid descriptor at jar:file:/C:/Users/User/.m2/repository/com/github/heideltime/heideltime/2.2.1/heideltime-2.2.1.jar!/desc/type/HeidelTime_TypeSystemStyleMap.xml.
at org.apache.uima.util.impl.XMLParser_impl.parse(XMLParser_impl.java:218)
at org.apache.uima.util.impl.XMLParser_impl.parseTypeSystemDescription(XMLParser_impl.java:729)
at org.apache.uima.util.impl.XMLParser_impl.parseTypeSystemDescription(XMLParser_impl.java:718)
at org.apache.uima.fit.factory.TypeSystemDescriptionFactory.createTypeSystemDescription(TypeSystemDescriptionFactory.java:107)
at org.apache.uima.fit.factory.CollectionReaderFactory.createReader(CollectionReaderFactory.java:213)
at de.uniba.minf.msc.stemper.corpus.pantheon.Pipeline.main(Pipeline.java:37)
Caused by: org.apache.uima.util.InvalidXMLException: The XML parser encountered an unknown element type: styleMap.
at org.apache.uima.util.impl.XMLParser_impl.buildObject(XMLParser_impl.java:301)
at org.apache.uima.util.impl.SaxDeserializer_impl.getObject(SaxDeserializer_impl.java:142)
at org.apache.uima.util.impl.XMLParser_impl.parse(XMLParser_impl.java:209)
... 5 more
HeidelTime组件似乎无法与其他分析引擎正确翻译。文档说它应该,但是存储库中缺少负责类,也可能来自我提取的Maven工件。我不知道从哪里开始寻找解决方案,到目前为止,我没有发现任何暗示在线方向,除了一些老问题如何使用独立here和{{ 3}}