我正在使用PDFQuery从PDF中提取数据。它适用于大多数PDF。
最近,对于少数PDF,我在几页上遇到以下错误:
'ascii' codec can't encode character u'\u2019' in position 91: ordinal not in range(128)
'ascii' codec can't encode character u'\u2013' in position 29: ordinal not in range(128)
我的代码如下所示:
pdf = pdfquery.PDFQuery(pdf_file)
pages_in_pdf = pdf.doc.catalog['Pages'].resolve()['Count']
for i in range(0, pages_in_pdf):
try:
pdf.load(i)
# logic
except ValueError as e:
print('Error on page number {0}. Error message is {1}'.format(i, e))