Question

我正在尝试获取word文档中包含的String文本。我尝试使用Apache POI api的代码是：

FileInputStream fis = new FileInputStream(file.getAbsolutePath());
        HWPFDocument document = new HWPFDocument(fis);
        WordExtractor extractor = new WordExtractor(document);
        String fileData = extractor.getText();

fileData应包含word文件中的数据。

但我得到了一些我想消除的无效字符。例如，单词中的以下文字：

Project Name    Customer 360

Position        Software Engineer

在java控制台中打印时出现：

Project Name [?]Customer 360[?][?]Position \t [?]Software Engineer

[?]是小方框中的问号符号。当我在这里粘贴它时，它不会出现，所以我只使用[?]来表示它。

我希望输出代之以：

Project Name \t Customer 360 \n Position \t Software Engineer

这给了我一个关于我真正需要处理这个文本的标签和新行的信息。

我知道有tab和newline信息，因为我在某些地方收到了\t和\n，但在某些地方却丢失了。

Answer 1

看起来您有一些特殊字段适用于该文本。很可能它有适用于它的链接，特殊规则，表单字段等

如果您不想要所有这些，则需要使用stripFields(java.lang.String) method on WordExtractor删除它们，只留下显示的文字。

来自the JavaDoc on that method：

从字符串中删除任何字段（例如宏，页面标记等）。

您的代码将是：

    FileInputStream fis = new FileInputStream(file.getAbsolutePath());
    HWPFDocument document = new HWPFDocument(fis);
    WordExtractor extractor = new WordExtractor(document);
    String rawText = extractor.getText();
    String displayText = extractor.stripFields(rawText);

Word文件到Java字符串：缺少格式信息并显示无效字符

1 个答案: