Question

我正在解析PDF文件以使用Apache Tika提取文本。

//Create a body content handler
BodyContentHandler handler = new BodyContentHandler();

//Metadata
Metadata metadata = new Metadata();

//Input file path
FileInputStream inputstream = new FileInputStream(new File(faInputFileName));

//Parser context. It is used to parse InputStream
ParseContext pcontext = new ParseContext();

try
{       
    //parsing the document using PDF parser from Tika.
    PDFParser pdfparser = new PDFParser();

    //Do the parsing by calling the parse function of pdfparser
    pdfparser.parse(inputstream, handler, metadata,pcontext);

}catch(Exception e)
{
    System.out.println("Exception caught:");
}
String extractedText = handler.toString();

上面的代码有效，PDF中的文字被提取出来。

PDF文件中有一些特殊字符（例如@ /＆amp; /£或商标符号等）。如何在提取过程中或之后删除这些特殊字符？

Answer 1

PDF使用unicode代码点，您可能拥有包含代理项对的字符串，组合表单（例如变音符号）等，并且可能希望将这些保留为最接近的ASCII等效项，例如将é标准化为{{1} }}。如果是这样，你可以这样做：

如果您只是简单地使用ASCII文本，那么一旦规范化，您可以按照this answer使用正则表达式过滤从Tika获得的字符串：

import java.text.Normalizer;

String normalisedText = Normalizer.normalize(handler.toString(), Normalizer.Form.NFD);

但是，由于正则表达式可能很慢（特别是在大字符串上），您可能希望避免使用正则表达式并进行简单替换（根据this answer）：

extractedText = normalisedText.replaceAll("[^\\p{ASCII}]", "");

使用Apache Tika从text / PDF中删除特殊字符

1 个答案: