我必须解析一个格式化的pdf以获得一些领域。 PDF为here。 this imgur中显示了我需要解析的内容。我已经使用PyPDF2来获取文本,但是它返回的原始文本没有任何格式。
import PyPDF2
pdfFileObj = open('GPO-PLUMBOOK-2000-4-1.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
我得到的输出如下:
LEGISLATIVE BRANCHLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresARCHITECT OF THE CAPITOLAlan M. HantmanWashington, DCArchitect of the Capitol10 years02/02/07IIIEXPASLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresGENERAL ACCOUNTING OFFICEDavid M. WalkerWashington, DCComptroller General of the United States11/09/1315 years$141,300OTPASVacant Do...........Deputy Comptroller General of the United States..................OTXSLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresGOVERNMENT PRINTING OFFICEMichael F. DiMarioWashington, DCPublic Printer............IIIEXPASRobert T. Mansker Do...........Deputy Public Printer............IVEXXSFrancis J. Buckley, Jr. Do...........Superintendent of Documents..................SLXSRobert G. Andary Do...........Inspector General..................SLXSMary Beth Lawler Do...........Staff Assistant............14OTSCLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresLIBRARY OF CONGRESSLIBRARIAN OF CONGRESSJames H. BillingtonWashington, DCLibrarian of Congress............IIIEXPASLIBRARY OF CONGRESS TRUST FUND BOARDJames H. Billington Do...........Chairman (Ex-Officio)..................WCPASTed Stevens Do...........Chairman of the Joint Committee of the Library (Ex-Officio)..................WCXSLawrence Summers Do...........Member (Ex-Officio), Secretary of the Treasury..................WCPASDonald Hammond Do...........Member (Designee for the Secretary of the Treasurer)..................WCXSCeil Pulitzer Do...........Member5 years03/23/03......WCPASNajeeb Halaby Do...........Member5 years08/31/05......WCPASJohn Kluge Do...........Member5 years03/10/03......WCXSWayne Berman Do...........Member5 years12/22/01......WCXSEdwin Cox Do...........Member5 years03/31/04......WCXSJohn Henry Do...........Member5 years12/22/03......WCXSDonald Jones Do...........Member5 years10/08/02......WCXSJulie Finley Do...........Member5 years06/29/01......WCXSBernard Rappaport Do...........Member5 years12/22/01......WCXS(1)
我需要分隔数据,例如Location
列下的数据,依此类推。
答案 0 :(得分:0)
看看tabula
库(这里是github)。这将返回一个熊猫数据框。
df = tabula.read_pdf("/home/michael/Downloads/GPO-PLUMBOOK-2000-4-1.pdf", pages=1)
df.dropna(inplace=True)
print(df[:2])
如果您需要阅读其他表格或希望节省时间,还可以调整应使用pdf的哪一部分。这样,您就可以读取pdf表,并将我的数据切成所需的输出。