使用python从格式化的PDF中提取文本

时间:2019-07-06 19:15:00

标签: python python-3.x parsing pdf pypdf2

我必须解析一个格式化的pdf以获得一些领域。 PDF为herethis imgur中显示了我需要解析的内容。我已经使用PyPDF2来获取文本,但是它返回的原始文本没有任何格式。

import PyPDF2
pdfFileObj = open('GPO-PLUMBOOK-2000-4-1.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())

我得到的输出如下:

LEGISLATIVE BRANCHLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresARCHITECT OF THE CAPITOLAlan M. HantmanWashington, DCArchitect of the Capitol10 years02/02/07IIIEXPASLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresGENERAL ACCOUNTING OFFICEDavid M. WalkerWashington, DCComptroller General of the United States11/09/1315 years$141,300OTPASVacant  Do...........Deputy Comptroller General of the United States..................OTXSLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresGOVERNMENT PRINTING OFFICEMichael F. DiMarioWashington, DCPublic Printer............IIIEXPASRobert T. Mansker  Do...........Deputy Public Printer............IVEXXSFrancis J. Buckley, Jr.  Do...........Superintendent of Documents..................SLXSRobert G. Andary  Do...........Inspector General..................SLXSMary Beth Lawler  Do...........Staff Assistant............14OTSCLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresLIBRARY OF CONGRESSLIBRARIAN OF CONGRESSJames H. BillingtonWashington, DCLibrarian of Congress............IIIEXPASLIBRARY OF CONGRESS TRUST FUND BOARDJames H. Billington  Do...........Chairman (Ex-Officio)..................WCPASTed Stevens  Do...........Chairman of the Joint Committee of the Library (Ex-Officio)..................WCXSLawrence Summers  Do...........Member (Ex-Officio), Secretary of the Treasury..................WCPASDonald Hammond  Do...........Member (Designee for the Secretary of the Treasurer)..................WCXSCeil Pulitzer  Do...........Member5 years03/23/03......WCPASNajeeb Halaby  Do...........Member5 years08/31/05......WCPASJohn Kluge  Do...........Member5 years03/10/03......WCXSWayne Berman  Do...........Member5 years12/22/01......WCXSEdwin Cox  Do...........Member5 years03/31/04......WCXSJohn Henry  Do...........Member5 years12/22/03......WCXSDonald Jones  Do...........Member5 years10/08/02......WCXSJulie Finley  Do...........Member5 years06/29/01......WCXSBernard Rappaport  Do...........Member5 years12/22/01......WCXS(1)

我需要分隔数据,例如Location列下的数据,依此类推。

1 个答案:

答案 0 :(得分:0)

看看tabula库(这里是github)。这将返回一个熊猫数据框。

df = tabula.read_pdf("/home/michael/Downloads/GPO-PLUMBOOK-2000-4-1.pdf", pages=1)
df.dropna(inplace=True)
print(df[:2])

如果您需要阅读其他表格或希望节省时间,还可以调整应使用pdf的哪一部分。这样,您就可以读取pdf表,并将我的数据切成所需的输出。