我在python中编写了一个脚本,使用css选择器来解析网页中的一些名称和电话号码。我创建的脚本没有给我我期望的结果;相反,一些我不想要的信息也会出现。如何纠正我的选择器,以便它将唯一地解析名称和电话号码,而不是其他任何东西。为了您的考虑,我在底部粘贴了一个包含html元素的链接。提前谢谢。
这是我写的:
from lxml.html import fromstring
root = fromstring(html)
for tree in root.cssselect(".cbFormTableEvenRow"):
try:
name = tree.cssselect(".cbFormDataCell span.cbFormData")[0].text
except:
name = ""
try:
phone = tree.cssselect(".cbFormLabel:contains('Phone Number')+td.cbFormDataCell .cbFormData")[0].text
except:
phone = ""
print(name,phone)
结果我期待:
JAYMES CARTER (402)499-8846
结果我得到了:
1840390831
RESIDENTIAL
JAYMES CARTER (402)499-8846
None
My valuation jumped by almost $60,000 in one year. There are multiple comparable properties nearby that are much lower than my $194,300 evaluation, and a lot closer to my 2016 year evaluation of $134,400.
链接到html文件:
https://www.dropbox.com/s/64apg5cjpssd3hb/html_table.html?dl=0
答案 0 :(得分:1)
找到tr
元素,该元素是文本为“电话号码”的span
的祖父母。从那里,获取所需项目的td
元素,然后按照这些元素的层次结构查看它们的文本。
>>> from lxml.html import fromstring
>>> root = fromstring(open('html_table.html').read())
>>> grand_parent = root.xpath('.//td[contains(text(),"Phone Number")]/..')[0]
>>> grand_parent.xpath('td[1]/span/text()')[0]
'JAYMES CARTER'
>>> grand_parent.xpath('td[5]/span/text()')[0]
'(402)499-8846'
回复评论的附录:
>>> items = grand_parent.xpath('.//span[@class="cbFormData"]/text()')
['JAYMES CARTER', '\xa0', '(402)499-8846']
>>> items = grand_parent.xpath('.//span[@class="cbFormData"]/text()')
>>> [_.replace('\xa0', '').strip() for _ in items]
['JAYMES CARTER', '', '(402)499-8846']