如何使用xpath和lxml从以下可怕的html中选择这些元素?

时间:2010-11-19 16:48:36

标签: python html xpath lxml

我想使用lxml和一些聪明的xpath从这个html中选择以下字符串。字符串会改变,但周围的html不会。

我需要......

  • 19/11/2010
  • AAAAAA/01
  • Normal
  • United Kingdom
  • This description may contains <bold>html</bold> but i still need all of it!

...从

...
<p>
    <strong>Date:</strong> 19/11/2010<br>
    <strong>Ref:</strong> AAAAAA/01<br>
    <b>Type:</b> Normal<br>
    <b>Country:</b> United Kingdom<br>
</p>
<hr>
<p>
    <br>
    <b>1. Title:</b> The Title<br>
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br>
    <b>3. Date:</b> 25th October<br>
...

</p>

...

到目前为止,我只提出使用正则表达式和re:match来尝试将其拖出来,但即使这样也行不通,只有这样才能让我获得{{1}的innerHTML用于exapmle的节点。

有没有办法在没有通过正则表达式对字符串进行后处理的情况下执行此操作?

谢谢:)

2 个答案:

答案 0 :(得分:2)

非常难看!通过这种正确的输入:

<html>
<p>
    <strong>Date:</strong> 19/11/2010<br/>
    <strong>Ref:</strong> AAAAAA/01<br/>
    <b>Type:</b> Normal<br/>
    <b>Country:</b> United Kingdom<br/>
</p>
<hr/>
<p>
    <br/>
    <b>1. Title:</b> The Title<br/>
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br/>
    <b>3. Date:</b> 25th October<br/>
</p>
</html>

最简单的情况:

/html/p/strong[.='Date:']/following-sibling::text()[1]

评估为:

 19/11/2010

所有这些都在一个:

/html/p/*[self::strong[.='Date:' or .='Ref:']|
          self::b[.='Type:' or .='Country:']]
         /following-sibling::text()[1]

复杂的一个:

/html/p/node()[preceding-sibling::b[1][.='2. Description: ']]
              [following-sibling::b[1][.='3. Date:']]
              [not(self::br)]

答案 1 :(得分:0)

这并不困难。

鉴于此XML文档:

<html> 
<p> 
    <strong>Date:</strong> 19/11/2010<br/> 
    <strong>Ref:</strong> AAAAAA/01<br/> 
    <b>Type:</b> Normal<br/> 
    <b>Country:</b> United Kingdom<br/> 
</p> 
<hr/> 
<p> 
    <br/> 
    <b>1. Title:</b> The Title<br/> 
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br/> 
    <b>3. Date:</b> 25th October<br/> 
</p> 
</html> 
  

我需要......

     
      
  • 19/11/2010
  •   
  • AAAAAA / 01
  •   
  • 正常
  •   
  • 英国
  •   

此XPath表达式选择所有上述文本节点

/*/p[1]/text()
  
      
  • 此描述可能包含html但我仍需要所有   它!
  •   

使用此

/*/p[2]/b[2]/following-sibling::node()
                 [count(.|/*/p[2]/b[2]/following-sibling::br[1]/preceding-sibling::node()) 
                = 
                  count((/*/p[2]/b[2]/following-sibling::br[1]/preceding-sibling::node()))
                 ]