我正在尝试使用pdf.js
从pdf文档中提取纯文本,由于某种原因,我无法克服Invalid PDF structure
错误。
我的代码如下:
const pdfjslib = require('pdfjs-dist');
const pdfPath = 'https://www.corenet.gov.sg/media/2268607/dc19-07.pdf'
var loadingTask = pdfjslib.getDocument(pdfPath);
loadingTask.promise.then(async (doc) => {
console.log(doc);
return null
})
.catch((err)=>{
console.log(err)
});
我尝试了来自同一域的其他pdf文档,但都引发了相同的错误:
...
Warning: Ignoring invalid character "34" in hex string
Warning: Ignoring invalid character "104" in hex string
Warning: Indexing all PDF objects
{ Error
at InvalidPDFExceptionClosure (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:658:35)
at Object.<anonymous> (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:661:2)
at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
at Object.defineProperty.value (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:129:23)
at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
at pdfjsVersion (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:116:18)
at .../pdf_test/node_modules/pdfjs-dist/build/pdf.js:119:10
at webpackUniversalModuleDefinition (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:25:20)
at Object.<anonymous> (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:32:3)
at Module._compile (internal/modules/cjs/loader.js:776:30)
name: 'InvalidPDFException',
message: 'Invalid PDF structure' }
来自其他域的其他pdf似乎有效。请注意,从上述域下载pdf效果很好,并且可以在Chrome浏览器中查看。我怀疑pdf文档已损坏。我没有实现任何前端代码,因为以上代码的意图是将其托管在云上。