Question

我们需要使用Java从给定的文本文档中获取树状结构。使用的文件类型应该是通用的并且是开放的（rtf，odt，...）。目前，我们使用Apache Tika来解析来自多个文档的纯文本。

我们应该使用哪种文件类型和API，以便最可靠地解析正确的结构？如果Tika可以做到这一点，我很乐意看到任何示威活动。

例如，我们应该从给定的文档中获取此类数据：

Main Heading
  Heading 1
    Heading 1.1
  Heading 2
    Heading 2.2

主标题是论文的标题。纸张有两个主要标题，标题1和标题2，它们都有一个副标题。我们还应该在每个标题（段落文本）下获得内容。

感谢任何帮助。

Answer 1

OpenDocument（.odt）实际上是一个包含多个xml文件的zip包。 Content.xml包含文档的实际文本内容。我们对标题感兴趣，可以在文本中找到它们：h标签。详细了解ODT。

我找到了一个使用QueryPath从.odt文件中提取标题的实现。

由于最初的问题是关于Java的，所以就是这样。首先，我们需要使用ZipFile访问content.xml。然后我们使用SAX从content.xml中解析xml内容。示例代码只打印出所有标题：

Test3.odt
content.xml
3764
1 My New Great Paper
2 Abstract
2 Introduction
2 Content
3 More content
3 Even more
2 Conclusions



Sample code:

    public void printHeadingsOfOdtFIle(File odtFile) {

    try {

        ZipFile zFile = new ZipFile(odtFile);
        System.out.println(zFile.getName());

        ZipEntry contentFile = zFile.getEntry("content.xml");

        System.out.println(contentFile.getName());
        System.out.println(contentFile.getSize());
        XMLReader xr = XMLReaderFactory.createXMLReader();
        OdtDocumentContentHandler handler = new OdtDocumentContentHandler();
        xr.setContentHandler(handler);

        xr.parse(new InputSource(zFile.getInputStream(contentFile)));

    } catch (Exception e) {

        e.printStackTrace();

    }

}

public static void main(String[] args) {

    new OdtDocumentStructureExtractor().printHeadingsOfOdtFIle(new File("Test3.odt"));

}


Relevant parts of used ContentHandler look like this:



Sample code:

使用Java解析文档结构

1 个答案: