从顶部到底部遍历多个列表

时间:2018-09-05 05:56:07

标签: regex python-3.x loops lxml itertools

我需要有关使用xpath表达式遍历多个列表并根据出现的顶部到底部进行数据解析的帮助。

我将PDF文件转换为XML文件,这是PDFXML

的链接

我在python中尝试过此代码:

from lxml import etree, html
source = open('Document 2.xml').read()

asd = html.fromstring(source.encode('utf8'))

for i in itertools.zip_longest(asd.xpath('//text[@left="54" and @font="6" and @height="28"]//b//text()'),
                       asd.xpath('//text[@left="54" and @font="10" and @height="25"]//b//text()'),
                       asd.xpath('//text[@left="54" and @font="12" and @height="18"]//b//text()'),
                       asd.xpath('//text[@left="54" and @font="0" and @height="15"]//b//preceding-sibling::text()'),
                       asd.xpath('//text[@left="54" and @font="0" and @height="15"]//b//text()')):
    print (i)

我的输出为:

('Management', 'Executive Committee', 'Management', 'Managing Partner ', 'Charles P. Adams, Jr.')
('Practice Groups', 'Associates', 'Management', 'Chairman ', 'Victor H. Lott, Jr.')
('Administration', 'Claims Counsel and Loss Prevention', 'Management', 'Partner ', 'Holmes S. Adams')
('U.S. Offices', 'Clients', 'Management', 'Partner ', 'John M. Duck')
(None, 'Community Service/Pro Bono', 'Management', 'Partner ', 'William B. Gaudet')
(None, 'Financial', 'Management', 'Partner ', 'M. Ann Huckstep')
(None, 'Governance', 'Management', 'Partner ', 'Francis V. Liantonio, Jr.')
(None, 'Hiring Committee', None, 'Partner ', 'Deborah B. Rouen')
(None, 'Human Resources', None, 'Partner ', 'Charles A. Cerise, Jr.')
(None, 'Liaison Partner for Administration', None, 'Partner ', 'Martin A. Stern')
(None, 'Litigation', None, 'Liaison Partner ', 'William J. Kelly, III')
(None, 'Special Business Services', None, 'Partner ', 'Robert A. Vosbein')
(None, 'Transactions and Corporate Advisory', None, 'Partner ', 'Edward J. Rice, Jr.')
(None, 'Adams and Reese/Lange Simpson', None, 'Partner ', 'Glen M. Pilie')
(None, 'Mobile, AL Office', None, 'Partner ', 'William J. Kelly, III')
(None, 'Washington, DC Office', None, 'Partner ', 'Leslie A. Lanusse')
(None, 'Baton Rouge, LA Office', None, 'Partner ', 'James W. Young, Jr.')
(None, 'New Orleans, LA Office', None, 'Practice Group Leader ', 'Mark R. Beebe')
(None, 'Jackson, MS Office', None, 'Partner ', 'Thomas G. O’Brien')
(None, 'Houston, TX Office', None, 'Partner ', 'Stephen A. Rowe')
(None, None, None, 'Practice Group Leader ', 'Robin B. Cheatham')
(None, None, None, 'Practice Group Leader ', 'Craig G. Townsend')
(None, None, None, 'Practice Group Leader ', 'Mark W. Coffin')
(None, None, None, 'Practice Group Leader ', 'Powell G. Ogletree, Jr.')
(None, None, None, 'Chief Administrative Officer ', 'Robert M. Shofstahl')
(None, None, None, 'Chief Financial Officer ', 'Paul J. Lassalle')
(None, None, None, 'Chief Information Officer ', 'David G. Erwin, Jr.')
(None, None, None, 'Chief Marketing Officer ', 'Ann M. Wallace')
(None, None, None, 'Human Resources Director ', 'Linda Soileau')
(None, None, None, 'Purchasing Manager ', 'Morris Green')
(None, None, None, 'Recruiting Manager ', 'Ami D. Orr')
(None, None, None, 'Special Events Coordinator ', 'Teresa Lauga')
(None, None, None, 'Manager of Library Services ', 'Amy Schwarzenbach')
(None, None, None, 'Partner-in-Charge ', 'Joe A. Joseph')
(None, None, None, 'Partner-in-Charge ', 'W. David Johnson')
(None, None, None, 'Partner-in-Charge ', 'B. Jeffrey Brooks')
(None, None, None, 'Partner-in-Charge ', 'Claire Babineaux-Fontenot')
(None, None, None, 'Partner-in-Charge ', 'Edwin C. Laizer')
(None, None, None, 'Partner-in-Charge ', 'Mark C. Surprenant')
(None, None, None, 'Partner-in-Charge ', 'A. Jerry Sheldon')
(None, None, None, 'Partner-in-Charge ', 'Walter J. Cicack')

但是我需要这样的输出:GOOGLE_SPREADSHEET

因此,基本上,我希望xpath从上到下运行,并且对于[]段[0,1,2]之间的任何部分都不返回,如果您看到PDF文件,您将了解我要实现的目标。

我不确定我是否能够正确解释我的需求。如果不清楚的话请发表评论。如果可以使用正则表达式完成操作,我也可以接受

0 个答案:

没有答案