Question

我有一系列段落，我试图使用xpath进行解析。 html的格式如下：

<div id="content_third">
 <h3>Title1</h3>
 <p>
  <strong>District</strong>
  John Q Public <br>
  Susie B Private 
 <p>
 <p>
  <strong>District</strong>
  Anna C Public <br>
  Bob J Private 
 <p>
 <h3>Title1</h3>
 <p>
  <strong>District</strong>
  John Q Public <br>
  Susie B Private 
 <p>
 <p>
  <strong>District</strong>
  Anna C Public <br>
  Bob J Private 
 <p>
</div>

我正在建立一个这样的初始循环：

titles = tree.xpath('//*[@id="content_third"]/h3')
for num in range(len(titles):

然后是内循环：

district_races = tree.xpath('//*[@id="content_third"]/p[count(preceding-sibling::h3)={0}]'.format(num))
for index in range(len(district_races)):

每个循环，我想只选择＆＃34;区＆＃34;在strong内。我试过这个，除了一个充满了所有区域的阵列外，它会吐出空阵列：

zone = tree.xpath('//*[@id="content_third"]/p[count(preceding-sibling::h3)={0}/strong[{1}]/text()'.format(num, index))

喜欢那些没有格式化的州选举网页。

Answer 1

我假设每个区都是一个实际名称的占位符，所以为了让每个区域比你想要做的更简单，只需从中提取文本每个 p 中的每个 strong ：

h = """<div id="content_third">
 <h3>Title1</h3>
 <p>
  <strong>District</strong>
  John Q Public <br>
  Susie B Private
 <p>
 <p>
  <strong>District</strong>
  Anna C Public <br>
  Bob J Private
 <p>
 <h3>Title1</h3>
 <p>
  <strong>District</strong>
  John Q Public <br>
  Susie B Private
 <p>
 <p>
  <strong>District</strong>
  Anna C Public <br>
  Bob J Private
 <p>
</div>"""

from lxml import html

tree = html.fromstring(h)

print(tree.xpath('//*[@id="content_third"]/p/strong/text()'))

python xpath循环遍历段落并抓住<strong>

1 个答案: