我使用pdfminer将pdf转换为文本,但某些数字或单词或'['无法显示
import sys
import importlib
importlib.reload(sys)
from pdfminer.pdfparser import PDFParser,PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal,LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
path = r'C:\\Users\\User\\Desktop\\2002\\1999-66.pdf'
def parse():
fp = open(path, 'rb')
praser = PDFParser(fp)
doc = PDFDocument()
praser.set_document(doc)
doc.set_parser(praser)
doc.initialize()
if not doc.is_extractable:
raise PDFTextExtractionNotAllowed
else:
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in doc.get_pages():
interpreter.process_page(page)
layout = device.get_result()
for x in layout:
if (isinstance(x, LTTextBoxHorizontal)):
with open(r'C:/Users/User/Desktop/2002/3.txt', 'a') as f:
results = x.get_text()
print(results)
f.write(results)
if __name__ == '__main__':
parse()
我希望输出是
[BP] Sergey Brin and Larry Page. Google search engine. http://google.stanford.edu.
[CGMP98] Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. Ecient crawling through
URL ordering. In To Appear: Proceedings of the Seventh International Web Conference
(WWW 98), 1998.
[Gar95] Eugene Gareld. New international professional society signals the maturing of scientometrics and informetrics. The Scientist, 9(16), Aug 1995. http://www.the-scientist.
library.upenn.edu/yr1995/august/issi_950821.ht%ml.
[Gof71] William Goman. A mathematical method for analyzing the growth of a scientic
discipline. Journal of the ACM, 18(2):173{185, April 1971.
但是实际输出是
BP
Sergey Brin and Larry Page. Google search engine. http:google.stanford.edu.
CGMP Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. E cient crawling through
url ordering. In To Appear: Proceedings of the Seventh International Web Conference
WWW , .
Gar
Eugene Gar eld. New international professional society signals the maturing of sciento-
metrics and informetrics. The Scientist, , Aug . http:www.the-scientist.
library.upenn.eduyr augustissi_ .html.
Gof William Go man. A mathematical method for analyzing the growth of a scienti c
discipline. Journal of the ACM, :, April .