如何在标题(h3)标签之间获取文本?

时间:2018-09-26 10:08:30

标签: python html xml xpath

我们有这样的数据:

<h3>title1</h3>
<p> paragraph 1<p>
<p> paragraph 2<p>
<p> paragraph 3<p>
<h3>title2</h3>
<p> paragraph 4<p>
<p> paragraph 5<p>
<table>
    <tr>
        <td>data1</td>
        <td>data2</td>
     </tr>
</table>
<h3>title3</h3>
<p> paragraph 6<p>
<p> paragraph 7<p>
<p> paragraph 8<p>
<p> paragraph 9<p>
<h3>title4</h3>
<p> paragraph 10<p>
<p> paragraph 11<p>
<p> paragraph 12<p>

如何获取h3之间的数据,即

  1. [第1段,第2段,第3段]

  2. [第4段,第5段,data1,data2]

  3. [第6段,第7段,第8段,第9段]

  4. [第10段,第11段,第12段]

我使用了以下XPath:

  1. hdoc.xpath('h3[contains(.,"title1")]//following-sibling::*[following::*[self::h3]]//text()')

  2. hdoc.xpath('h3[contains(.,"title2")]//following-sibling::*[following::*[self::h3]]//text()')

2 个答案:

答案 0 :(得分:1)

尝试类似的东西:

hdoc.xpath("//p[./preceding-sibling::h3[contains(text(),'title1')] and ./following-sibling::h3[contains(text(),'title2')]]/text()")

hdoc.xpath("//p[./preceding-sibling::h3[contains(text(),'title2')] and ./following-sibling::h3[contains(text(),'title3')]]/text()")

hdoc.xpath("//p[./preceding-sibling::h3[contains(text(),'title3')] and ./following-sibling::h3[contains(text(),'title4')]]/text()")

hdoc.xpath("//p[./preceding-sibling::h3[contains(text(),'title4')] and not(./following-sibling::h3)]/text()")

如果您不想依赖每个h3的文本,则可以获取每个元素之前的h3数量:

# For elements between title1 and title2
hdoc.xpath('//p[count(preceding-sibling::h3)=1]/text() | //table[count(preceding-sibling::h3)=2]//td/text()')

# For elements between title2 and title3
hdoc.xpath('//p[count(preceding-sibling::h3)=2]/text() | //table[count(preceding-sibling::h3)=2]//td/text()')
...

答案 1 :(得分:0)

此XPath,

//text()[    preceding::h3[. = 'title1'] 
         and following::h3[. = 'title2']]

将选择具有给定字符串值的h3元素之间的所有文本节点。