我想提取这段文字:
DLA LAND AND MARITIME
ACTIVE DEVICES DIVISION
PO BOX 3990
COLUMBUS OH 43218-3990
USA
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930
Email: Desmond.Forshey@dla.mil
来自pdf file。我能够使用下面的代码在两个引用之间提取一些文本:
import PyPDF2
pdfFileObj = open('SPE7M518T446E.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
pageObj1 = pdfReader.getPage(0)
pagecontent = pageObj1.extractText()
def between(value, a, b):
# Find and validate before-part.
pos_a = value.find(a)
if pos_a == -1: return ""
# Find and validate after part.
pos_b = value.rfind(b)
if pos_b == -1: return ""
# Return middle part.
adjusted_pos_a = pos_a + len(a)
if adjusted_pos_a >= pos_b: return ""
return value[adjusted_pos_a:pos_b]
desired = between(pagecontent,"5. ","8. ")
print(desired)
上面的代码输出:
20
REQUEST FOR QUOTATIONSTHIS RFQ IS IS NOT A SMALL BUSINESS SET-ASIDE 4. CERT.FOR NAT. DEF. UNDER BDSA REG. 2 AND/OR DMS REG. 15. ISSUED BY7. DELIVERY 9. DESTINATION10. PLEASE FURNISH QUOTATIONS TO THE ISSUING OFFICE IN BLOCK 5 ON OR BEFORE CLOSE OF BUSINESS (Date)IMPORTANT: This is a request for information, and quotations furnished are not offers. If you are unable to quote, please so indicate on this form and return it to the address in Block 5. This request does not commit the Government to pay any costs incurred in the preparation of the submission of this quotation or to contract for supplies or services. Supplies are of domestic origin unless otherwise indicated by quoter. Any representations and/or certifications attached to this Request for Quotations must be completed by the quoter.11. SCHEDULE (See Continuation Sheets) 12. DISCOUNT FOR PROMPT PAYMENTd. CALENDAR DAYSNUMBERPERCENTAGE NOTE: Additional provisions and representations are are not attached.13. NAME AND ADDRESS OF QUOTERa. NAME OF QUOTER16. SIGNERAUTHORIZED FOR LOCAL REPRODUCTION Previous edition not useableSTANDARD FORM 18 (REV. 6-95) Prescribed by GSA-FAR (48 CFR) 53.215-1(a) SPE7M5-18-T-446E1. REQUEST NO.2018 APR 302. DATE ISSUED00739229623. REQUISITION/PURCHASE REQUEST NO.DO-C9RATINGDLA LAND AND MARITIME
ACTIVE DEVICES DIVISION
PO BOX 3990
COLUMBUS OH 43218-3990
USA
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930
Email: Desmond.Forshey@dla.mil175 DAYS ADO 6. DELIVER BY (Date)8. TO: c. CITYd. STATE b. STREET ADDRESS a. NAME OF CONSIGNEEe. ZIP CODE a. 10 CALENDAR DAYS (%)b. 20 CALENDAR DAYS (%) c. 30 CALENDAR DAYS (%)15. Date of Quotationa. NAME (Type or Print)
AREA CODEc. TITLE (Type or Print)d. CITY c. COUNTY b. STREET ADDRESSe. STATE f. ZIP CODESee Schedule2018 MAY 10NUMBERFOB DESTINATIONOTHER (See Schedule)CAGE b. TELEPHONE PAGE OF PAGES1
POC INFORMATION:
WHEN TECHNICAL DATA IS PROVIDED IT MUST BE OBTAINED AT:https://pcf1x.bsm.dla.mil/cfolders. DISCREPANCIES FOUND IN TECHNICAL DATA SHOULD SUBMIT
REQUEST TO THE DLA CUSTOMER SERVICE WEBSITE:https://www.pdmd.dla.mil/cs/
ALL OTHER QUESTIONS (SOLICITATION REQUIREMENTS, ITEM DESCRIPTION, AWARD CHOICE, ETC.), PLEASE CONTACT THE BUYER SHOWN ABOVE.
QUESTIONS REGARDING OPERATION OF THE DLA-BSM INTERNET BID BOARD SYSTEM SHOULD BE E-MAILED TO: DibbsBSM@dla.mil
FOR IMMEDIATE ASSISTANCE, PLEASE REFER TO THE FREQUENTLY ASKED QUESTIONS (FAQS) ON BSM DIBBS AT:
https://www.dibbs.bsm.dla.mil/Refs/help/DIBBSHelp.htm OR PHONE 1-855-DLA-0001 (1-855-352-0001).
MASTER SOLICITATION
THIS SOLICITATION INCORPORATES THE TERMS AND CONDITIONS SET FORTH IN THE DLA MASTER SOLICITATION FOR AUTOMATED SIMPLIFIED
ACQUISITIONS REVISION 46 (FEBRURARY 7, 2018) WHICH CAN BE FOUND ON THE WEB AT:
http://www.dla.mil/Portals/104/Documents/J7Acquisition/Master%20Solicitation%20Rev-46%20February-7-2018.pdf?ver=2018-02-08-063754-70
This solicitation incorporates technical/quality requirements (‚R™ or ‚I™ number in section B). The full text is in the DLA Technical and Quality Master List of Requirements at:
http://www.dla.mil/HQ/Acquisition/Offers/eprocurement.aspx The revisionof the TQ Master in effect on the award date controls.14. SIGNATURE OF PERSON AUTHORIZED TO SIGN QUOTATION 1 20
###################
ISSUED BY7. DELIVERY 9. DESTINATION10. PLEASE FURNISH QUOTATIONS TO THE ISSUING OFFICE IN BLOCK 5 ON OR BEFORE CLOSE OF BUSINESS (Date)IMPORTANT: This is a request for information, and quotations furnished are not offers. If you are unable to quote, please so indicate on this form and return it to the address in Block 5. This request does not commit the Government to pay any costs incurred in the preparation of the submission of this quotation or to contract for supplies or services. Supplies are of domestic origin unless otherwise indicated by quoter. Any representations and/or certifications attached to this Request for Quotations must be completed by the quoter.11. SCHEDULE (See Continuation Sheets) 12. DISCOUNT FOR PROMPT PAYMENTd. CALENDAR DAYSNUMBERPERCENTAGE NOTE: Additional provisions and representations are are not attached.13. NAME AND ADDRESS OF QUOTERa. NAME OF QUOTER16. SIGNERAUTHORIZED FOR LOCAL REPRODUCTION Previous edition not useableSTANDARD FORM 18 (REV. 6-95) Prescribed by GSA-FAR (48 CFR) 53.215-1(a) SPE7M5-18-T-446E1. REQUEST NO.2018 APR 302. DATE ISSUED00739229623. REQUISITION/PURCHASE REQUEST NO.DO-C9RATINGDLA LAND AND MARITIME
ACTIVE DEVICES DIVISION
PO BOX 3990
COLUMBUS OH 43218-3990
USA
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930
Email: Desmond.Forshey@dla.mil175 DAYS ADO 6. DELIVER BY (Date)
如何从PDF文件中提取以下文字?
DLA LAND AND MARITIME
ACTIVE DEVICES DIVISION
PO BOX 3990
COLUMBUS OH 43218-3990
USA
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930
Email: Desmond.Forshey@dla.mil
答案 0 :(得分:1)
PDF阅读器没有提供与返回数据结构交互的大量空间。虽然可以向它添加一个新函数,它将每个元素作为列表中的另一个项返回。然后,您至少可以在两个项目之间提取数据。这种方法仍然不是万无一失的,因为您仍然需要决定可能的终止案例:
import PyPDF2
import itertools
def extractTextList(self):
text_list = []
content = self["/Contents"].getObject()
if not isinstance(content, ContentStream):
content = ContentStream(content, self.pdf)
for operands, operator in content.operations:
if operator == b_("Tj"):
_text = operands[0]
if isinstance(_text, TextStringObject) and len(_text.strip()):
text_list.append(_text.strip())
elif operator == b_("T*"):
pass
elif operator == b_("'"):
pass
_text = operands[0]
if isinstance(_text, TextStringObject) and len(operands[0]):
text_list.append(operands[0])
elif operator == b_('"'):
_text = operands[2]
if isinstance(_text, TextStringObject) and len(_text):
text_list.append(_text)
elif operator == b_("TJ"):
for i in operands[0]:
if isinstance(i, TextStringObject) and len(i):
text_list.append(i)
return text_list
from PyPDF2.pdf import PageObject, u_, ContentStream, b_, TextStringObject
PageObject.extractTextList = extractTextList
def between(text_elements, drop_while, take_while):
return list(itertools.takewhile(take_while, itertools.dropwhile(drop_while, text_elements)))[1:]
pdfFileObj = open('SPE7M518T446E.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
page0 = pdfReader.getPage(0)
text_elements = page0.extractTextList()
lines = between(text_elements, lambda x: x != 'RATING', lambda x: 'DAYS' not in x)
print('\n'.join(lines))
这将为您提供所需的行,然后将它们组合成单个输出,如下所示:
DLA LAND AND MARITIME
ACTIVE DEVICES DIVISION
PO BOX 3990
COLUMBUS OH 43218-3990
USA
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930
Email: Desmond.Forshey@dla.mil
当新函数extractTextList()
返回页面中找到的文本元素列表时,我使用itertools.dropwhile()
和itertools.takewhile()
来处理返回的列表。
between()
函数分两个阶段工作,首先它一次读取一个字符串列表并丢弃它们,直到它匹配第一个测试(即找到RATING
)。然后它开始将元素返回到takewhile()
函数。这样可以持续获取元素,直到它找到其中一个元素中的单词DAYS
。 list()
用于创建筛选列表。然后我删除第一个元素(因为它是单词RATING
)。
实际上,这是在列表上进行切片的迭代方式。
注意:lambda
只是定义函数的另一种方式。在这种情况下,它需要一个名为x
的文本元素,如果它是某个值则返回True
,或者如果单词DAYS
位于其中的某个位置则返回。两个itertool函数为列表中的每个元素调用这些lambda函数。