Question

我试图从控制台上的pdf中提取和打印英文文本。使用PdfTextExtractor类通过itextpdf API完成提取。我得到的文字是不可理解的。可能是我面临的一些语言问题。我的目的是在PDF中查找特定文本并将其替换为其他字符串。我开始解析文件以找到字符串。以下代码片段代表我的字符串提取器：

Document document = new Document();

PdfWriter writer = PdfWriter.getInstance(document,
    new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(input);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {

    String str=PdfTextExtractor.getTextFromPage(reader, i); 
    System.out.println(str);  

}
document.close();

但即使PDF中的文字是英文，我在控制台上的输出也是不可理解的。

输出：

t cotenn dna o mntoafinir yales r ni et h layhcsip Amgteu end y Retila m eysts whh ethrs s wlli e erefcern emsyst o f et h se。 ru I n tioi，dnda etseh orpvedi eddda e ulav o t taw h s i oelbssip hwti se vdcie ollaw na s tiouquibu cacess o t latoutenxc e rpap dna t ilagid ottennc olae n ewnh ey th krwo tofoi。 nmirna ni soitaoli n mor f chea e。 roth s iTh s i a cel ra csea ewerh＆＃34; eth lweoh是ermo nath eth ms u u sti sti

rtasp＆＃34;。

任何人都可以帮助我帮助我提供可能的解决方案，用英语提供文本，就像在源PDF中一样。任何形式的帮助都将受到高度赞赏。

Answer 1

如果您希望根据文本在页面上的位置对文本进行排序，则需要引入特定策略，例如LocationTextExtractionStrategy：

for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    String str=PdfTextExtractor.getTextFromPage(reader, i, new LocationTextExtractionStrategy());
}

LocationTextExtractionStrategy有时会产生奇怪的句子，更具体地说，如果页面上的字母'跳舞'（字形的基线在同一行上的文字不同）。在这种情况下，您可以尝试SimpleTextExtractionStrategy，它将按照PDF语法内容流中显示的顺序返回文本。

使用itextpdf提取的英文文本是不可理解的

1 个答案: