Question

我正在尝试使用poi-scratchpad-3.8（HWPF）阅读Microsoft Word 2003文档（.doc）。我需要逐字逐句地读取文件，或者逐个字符地读取文件。无论哪种方式都可以满足我的需求。一旦我读完了一个字符或单词，我就需要获得应用于单词/字符的样式名称。所以，问题是，在阅读.doc文件时如何获取用于单词或字符的样式名称？

修改

我正在添加用于尝试此操作的代码。如果有人想尝试这个，祝你好运。

private void processDoc(String path) throws Exception {
    System.out.println(path);
    POIFSFileSystem fis = new POIFSFileSystem(new FileInputStream(path));
    HWPFDocument wdDoc = new HWPFDocument(fis);

    // list all style names and indexes in stylesheet
    for (int j = 0; j < wdDoc.getStyleSheet().numStyles(); j++) {
        if (wdDoc.getStyleSheet().getStyleDescription(j) != null) {
            System.out.println(j + ": " + wdDoc.getStyleSheet().getStyleDescription(j).getName());
        } else {
            // getStyleDescription returned null
            System.out.println(j + ": " + null);
        }
    }

    // set range for entire document
    Range range = wdDoc.getRange();

    // loop through all paragraphs in range
    for (int i = 0; i < range.numParagraphs(); i++) {
        Paragraph p = range.getParagraph(i);

        // check if style index is greater than total number of styles
        if (wdDoc.getStyleSheet().numStyles() > p.getStyleIndex()) {
            System.out.println(wdDoc.getStyleSheet().numStyles() + " -> " + p.getStyleIndex());
            StyleDescription style = wdDoc.getStyleSheet().getStyleDescription(p.getStyleIndex());
            String styleName = style.getName();
            // write style name and associated text
            System.out.println(styleName + " -> " + p.text());
        } else {
            System.out.println("\n" + wdDoc.getStyleSheet().numStyles() + " ----> " + p.getStyleIndex());
        }
    }

Answer 1

我建议您查看源代码WordExtractor from Apache Tika，因为它是使用Apache POI从Word文档获取文本和样式的一个很好的示例

根据你在你的问题中所做的和未说的，我怀疑你正在寻找像这样的东西：

    Range r = document.getRange();
    for(int i=0; i<r.numParagraphs(); i++) {
       Paragraph p = r.getParagraph(i);
       String text = p.getText();
       if( ! text.contains("What I'm Looking For")) {
          // Try the next paragraph
          continue;
       }

       if (document.getStyleSheet().numStyles()>p.getStyleIndex()) {
          StyleDescription style =
               document.getStyleSheet().getStyleDescription(p.getStyleIndex());
          String styleName = style.getName();
          System.out.println(styleName + " -> " + text);
       }
       else {
          // Text has an unknown or invalid style
       }
    }

对于更高级的内容，请查看WordExtractor源代码，看看还有什么可以用这种东西做的！

Java Apache POI读取Word（.doc）文件并获取使用的命名样式

1 个答案: