Question

假设我想将doc文件与元数据一起导入HTML文档，并相应地在div中显示。所以doc文件中的所有现有内容，如各种格式的文本（粗体，斜体，不同大小，字母间距，行高，上线，粗线......），图像（它们的位置和大小），图形，图表（JSP将生成必要的图形以提供类似的图形或图表。它只需要数据），列表等。

有没有办法做到这一点？是否有任何标准化的Word API可以提供这些数据？或者任何可以做到的JSP库？如果没有，那么我需要知道什么才能做到这一点？

Answer 1

查看Apache POI项目： http://poi.apache.org/text-extraction.html 以及Apache Tika： http://tika.apache.org /

Answer 2

5年后，答案是：

注意：此代码仅适用于旧单词'doc'文件（不是docx），Apache POI也可以处理docx，但您必须使用其他API。

使用Apache POI，maven依赖项：

<!-- https://mvnrepository.com/artifact/org.apache.poi/poi -->
<dependency>
  <groupId>org.apache.poi</groupId>
  <artifactId>poi</artifactId>
  <version>3.17</version>
</dependency>

以下是代码：

  ...
  import org.apache.poi.poifs.filesystem.DirectoryEntry;
  import org.apache.poi.poifs.filesystem.DocumentEntry;
  import org.apache.poi.poifs.filesystem.DocumentInputStream;
  import org.apache.poi.poifs.filesystem.POIFSFileSystem;

  public static void main(final String[] args) throws FileNotFoundException, IOException, NoPropertySetStreamException,
                  MarkUnsupportedException, UnexpectedPropertySetTypeException {
      try (final FileInputStream fs = new FileInputStream("src/test/word_template.doc");
        final POIFSFileSystem poifs = new POIFSFileSystem(fs)) {
        final DirectoryEntry dir = poifs.getRoot();
        final DocumentEntry siEntry = (DocumentEntry) dir.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
        try (final DocumentInputStream dis = new DocumentInputStream(siEntry)) {
          final PropertySet ps = new PropertySet(dis);
          final SummaryInformation si = new SummaryInformation(ps);
          // Read word doc (not docx) metadata.
          System.out.println(si.getLastAuthor());
          System.out.println(si.getAuthor());
          System.out.println(si.getKeywords());
          System.out.println(si.getSubject());
          // ...
        }
      }
    }

要阅读文本内容，您需要其他依赖项：

<dependency>
  <!-- Required for HWPFDocument -->
  <groupId>org.apache.poi</groupId>
  <artifactId>poi-scratchpad</artifactId>
  <version>3.17</version>
</dependency>

代码：

try (final HWPFDocument doc = new HWPFDocument(fs)) {
  return doc.getText().toString();
}

如何阅读旧单词doc文件元数据

2 个答案: