Question

我试图抓取页面http://apps.leg.wa.gov/wac/default.aspx?cite=296-17A&full=true以获得类型为

的输出

'6903-03' : u'Aerial spraying, seeding, crop dusting, or firefighting' ,
'6510-00' : u'Domestic servants/home care assistants employed in or about the private residence
of a home owner' ,
'1407-00' : u'Bus companies' ,

我正在使用scrapy。我使用了以下xpath

response.xpath('//*[@id="ctl00_ContentPlaceHolder1_dlSectionContent"]/span/div/span/text()').extract())

即使它正常工作但它返回了一些不需要的行，就像这些

一样

u'which also provide farm kill operations away from the custom meat shop',
u'Farm kill operations',
u'only',
u'no farm kill',
u'only',
u'4302-16 Farm kill',
u'exclusively',
u'only; ',
u'only',
u'no farm kill',
u'including farm kill',

我试图考虑的一种方法是在每一行上使用正则表达式来识别带有正则表达式的模式行u'(?:\d{2}){2}-(?:\d{1}){2} [A-Za-z ]*'

是否有更好或更清晰的方法来识别此类跨度。

PS： - 跨度没有任何课程。他们只有风格。我不确定我是否可以使用样式来识别所需的跨度。

Answer 1

XPath可能更具体，包括h3标签，以便能够引用其下一个兄弟

'//*[@id="ctl00_ContentPlaceHolder1_dlSectionContent"]/descendant::h3/following-sibling::div[1]/span/text()'

可以在Linux / Cygwin中使用

进行测试

xmllint --recover --html --xpath '//*[@id="ctl00_ContentPlaceHolder1_dlSectionContent"]/descendant::h3/following-sibling::div[1]/span' ~/tmp/test.html| sed -re 's%<span style=[^>]+>([^<]+)</span>%\1\n%g' | less

示例输出

0101-00 Land clearing: Highway, street and road construction, N.O.C.
0103-09 Drilling or blasting: N.O.C.
0104-12 Dredging, N.O.C.

如果他们没有不同的类别，请确定相同的元素

1 个答案: