我正在尝试从pdf文件中提取文本的特定部分。我已经使用PyPDF2
库来做到这一点。但是,当我执行下面的脚本时,我可以看到想要抓取的内容正在笨拙地打印在控制台中。
到目前为止,我已经写过:
import io
import PyPDF2
import requests
URL = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'
res = requests.get(URL)
f = io.BytesIO(res.content)
reader = PyPDF2.PdfFileReader(f)
contents = reader.getPage(0).extractText()
print(contents)
我得到的输出:
ACCESSHEALTHCTConnecticutAllPayersClaimsDatabaseDATASUBMISSIONGUIDE
December5,2013
Version1.2(withclarifications)
我想抓取的输出:
ACCESS HEALTH CT
Connecticut All Payers Claims Database
DATA SUBMISSION GUIDE
December 5, 2013
Version 1.2 (with clarifications)
答案 0 :(得分:3)
这是pyPDF2的问题,原因是PyPDF不读取换行符。另外,您可以pdftotext
简单干净,您可以循环浏览页面或提取一页。
from bs4 import BeautifulSoup as bs
import requests
base1 = 'https://www.daegu.ac.kr/article/DG159/detail/'
base2 = 'https://www.daegu.ac.kr/article/DG159'
r = requests.get('https://www.daegu.ac.kr/article/DG159/list')
soup = bs(r.content, 'lxml')
links = [base1 + a['onclick'].split('(')[1].split(')')[0] if a.has_attr('onclick') else base2 + a['href'] for a in soup.select('.board_tbl_list a')]
print(links)
答案 1 :(得分:2)
如果安装其他软件包会导致依赖性问题,我建议PDFMiner。
您可以通过执行data1 = [
{ key: "One", value: 5 },
{ key: "Two", value: 5 },
{ key: "Three", value: 5 }
];
data2 = [
{ key: "One", value: 10 },
{ key: "Two", value: 8 },
{ key: "Three", value: 5 }
];
为python 3.7安装它,我已经测试过了,并且可以在python 3.7上运行。
获取第0页的代码如下
pip install pdfminer.six
输出
import io
import requests
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
URL = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'
res = requests.get(URL)
fp = io.BytesIO(res.content)
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
page_no = 0
for pageNumber, page in enumerate(PDFPage.get_pages(fp)):
if pageNumber == page_no:
interpreter.process_page(page)
data = retstr.getvalue()
print(data.strip())
PDFMiner的优点在于它可以直接读取您的页面,并且完全专注于获取和分析文本数据。