Question

我有一个pdf文件的字节数组，想要从文件中获取文本。我的代码可以工作，但我需要先创建一个实际的文件。你知道更好的方法，所以我不必先创建这个文件吗？

try {
  File temp = File.createTempFile("temp-pdf", ".tmp");
  OutputStream out = new FileOutputStream(temp);
  out.write(Base64.decodeBase64(testObject.getPdfAsDoc().getContent()));
  out.close();
  PDDocument document = PDDocument.load(temp);
  PDFTextStripper pdfStripper = new PDFTextStripper();
  String text = pdfStripper.getText(document);
  log.info(text);
} catch(IOException e){

}

Answer 1

答案取决于您使用的PDFBox版本。

PDFBox 2.0.x

只要您有byte[]（您似乎从Base64.decodeBase64获得一个），就可以直接加载它：

byte[] documentBytes = Base64.decodeBase64(testObject.getPdfAsDoc().getContent());
PDDocument document = PDDocument.load(documentBytes);

PDFBox 1.8.x

只要您有byte[]，就可以通过ByteArrayInputStream

加载

byte[] documentBytes = Base64.decodeBase64(testObject.getPdfAsDoc().getContent());
InputStream documentStream = new ByteArrayInputStream(documentBytes);
PDDocument document = PDDocument.load(documentStream);

loadNonSeq

暂且不说：使用PDFBox 1.8.x时，您应该使用load重载而不是load，因为load未按规定加载PDF，因此，可以被愚弄用错误的内容读取它。但是，如果PDF文件损坏，您仍可以尝试class1(unittest.TestCase): def method1(self) class2(unittest.TestCase): def method2(self): instance_name = class1("method1") instance_name.method1()作为后备。

有没有更好的方法来转换pdf字节数组与PdfStripper？

1 个答案:

PDFBox 2.0.x

PDFBox 1.8.x