我正在尝试从PDF的特定区域提取一些文本。 PDF共有10个部分,我希望提取第8部分标题下的所有内容。我有以下代码,它将提取PDF中的所有文本并提供其坐标,但是我不知道如何过滤它,仅在获得坐标后才给我想要的区域。
有人可以帮我写代码或在这个论坛上指引我到某个地方吗?
这是我的代码:
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
fp = open('1234.pdf', 'rb')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp,check_extractable=False)
for page in pages:
print('Processing next page...')
interpreter.process_page(page)
layout = device.get_result()
for lobj in layout:
if isinstance(lobj, LTTextBox):
x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
print('At %r is text: %s' % ((x, y), text))
这是我想要的结果的一部分:
At (50.4, 605.326) is text: 8. ClAIMS
At (48.0, 574.222) is text: (a) Transport Claims
At (48.0, 557.182) is text:
At (48.0, 539.76088) is text: Cover:
At (48.0, 522.7208800000001) is text: Period:
At (48.0, 504.72088) is text: Type:
At (48.0, 487.44088) is text: Premium:
At (133.46, 556.942) is text:
At (133.46, 539.64088) is text:
At (133.46, 522.6008800000001) is text:
At (133.46, 504.60088) is text:
At (133.46, 487.32088) is text:
At (169.46, 539.64088) is text: Transport - Liabilities
At (169.46, 522.6008800000001) is text: 03/07/2017 until 31/03/2018
At (169.46, 504.60088) is text: Mutual
At (169.46, 487.32088) is text: Minimum and Deposit Premium of USD 400,000 per annum
请帮助!这让我发疯了