我有一系列段落,我试图使用xpath进行解析。 html的格式如下:
<div id="content_third">
<h3>Title1</h3>
<p>
<strong>District</strong>
John Q Public <br>
Susie B Private
<p>
<p>
<strong>District</strong>
Anna C Public <br>
Bob J Private
<p>
<h3>Title1</h3>
<p>
<strong>District</strong>
John Q Public <br>
Susie B Private
<p>
<p>
<strong>District</strong>
Anna C Public <br>
Bob J Private
<p>
</div>
我正在建立一个这样的初始循环:
titles = tree.xpath('//*[@id="content_third"]/h3')
for num in range(len(titles):
然后是内循环:
district_races = tree.xpath('//*[@id="content_third"]/p[count(preceding-sibling::h3)={0}]'.format(num))
for index in range(len(district_races)):
每个循环,我想只选择&#34;区&#34;在strong
内。我试过这个,除了一个充满了所有区域的阵列外,它会吐出空阵列:
zone = tree.xpath('//*[@id="content_third"]/p[count(preceding-sibling::h3)={0}/strong[{1}]/text()'.format(num, index))
喜欢那些没有格式化的州选举网页。
答案 0 :(得分:2)
我假设每个区都是一个实际名称的占位符,所以为了让每个区域比你想要做的更简单,只需从中提取文本每个 p 中的每个 strong :
h = """<div id="content_third">
<h3>Title1</h3>
<p>
<strong>District</strong>
John Q Public <br>
Susie B Private
<p>
<p>
<strong>District</strong>
Anna C Public <br>
Bob J Private
<p>
<h3>Title1</h3>
<p>
<strong>District</strong>
John Q Public <br>
Susie B Private
<p>
<p>
<strong>District</strong>
Anna C Public <br>
Bob J Private
<p>
</div>"""
from lxml import html
tree = html.fromstring(h)
print(tree.xpath('//*[@id="content_third"]/p/strong/text()'))