如何在lxml和python的特定标签中查找文本?

时间:2015-01-03 14:17:37

标签: python python-3.x lxml.html

假设html源如下:

some other content here
<div class="box">
    <h5>this is another one title</h5>
    <p>text paragraph 1 here</p>
    <p>text paragraph 2 here</p>
    <p>text paragraph n here</p>
</div>
<div class="box">
    <h5>specific title</h5>
    <p>text paragraph 1 here</p>
    <p>text paragraph 2 here</p>
    <p>text paragraph 3 here</p>
    <p>text paragraph 4 here</p>
    <small>some specific character:here are some character</small>
</div>
<div class="box">
    <h5>this is another tow title</h5>
    <p>text paragraph 1 here</p>
    <p>text paragraph 2 here</p>
     <p>text paragraph n here</p>
</div>
some other content here

如果我想要输出:

具体标题

text paragraph 1 here
text paragraph 2 here
text paragraph 3 here
text paragraph 4 here

我想获得具体的标题和段落文字。 我想用python的lxml! 请帮帮我,我该怎么办?

1 个答案:

答案 0 :(得分:1)

使用xpath表达式.//h5[text()="specific title"]/following-sibling::p/text(),它将选择p标记旁边的h5标记文字,并带有特定标题:

>>> import lxml.html
>>>
>>> s = '''
... <html>
... some other content here
    ...
... <div class="box">
... <h5>specific title</h5>
... <p>text paragraph 1 here</p>
... <p>text paragraph 2 here</p>
... <p>text paragraph 3 here</p>
... <p>text paragraph 4 here</p>
... <small>some specific character:here are some character</small>
... </div>
... <div class="box">
... <h5>this is another tow title</h5>
    ...
... </div>
... some other content here
... </html>
... '''
>>>
>>> root = lxml.html.fromstring(s)
>>> root.xpath('.//h5[text()="specific title"]/following-sibling::p/text()')
['text paragraph 1 here', 'text paragraph 2 here', 'text paragraph 3 here',
 'text paragraph 4 here']
>>> print('\n'.join(root.xpath(
        './/h5[text()="specific title"]/following-sibling::p/text()')))
text paragraph 1 here
text paragraph 2 here
text paragraph 3 here
text paragraph 4 here