XPATH(再次)可选标记元素提取单个字符串

时间:2016-05-22 07:51:06

标签: python xpath browser

我有一个这些<TD>的列表,我正在使用列表理解来同时获取它们。 希望在两种情况下提取文本“v 11/4”,即有/无sup 必须提取到单个元素中(对于此行)。

ex 1

<td>
<b class="black">2</b>/6 <a href="/some/link"onclick=
"returnHtml.popup(this," title="whateveryoulike">(ABL TTTTTSSSSSS
11)</a>v 11/4</td>

ex 2

<td>
<b class="black">2</b>/6 <a href="/some/link"onclick=
"returnHtml.popup(this," title="whateveryoulike">(ABL TTTTTSSSSSS
11)</a>v<sup>1</sup> 11/4</td>

有什么想法吗?

1 个答案:

答案 0 :(得分:1)

识别文字的一种可能方式&#34; v 11/4&#34;两个<td>示例的一致性将是{em>&#39;并联<td>&#39; 之后的所有直接子文本节点(<a>)的连接。以下是使用lxml.html的示例实现:

>>> from lxml import html
>>> raw = '''<tr>
... <td>
... <b class="black">2</b>/6 <a href="/some/link" onclick=
... "returnHtml.popup(this," title="whateveryoulike">(ABL TTTTTSSSSSS
... 11)</a>v 11/4</td>
... <td>
... <b class="black">2</b>/6 <a href="/some/link" onclick=
... "returnHtml.popup(this," title="whateveryoulike">(ABL TTTTTSSSSSS
... 11)</a>v<sup>1</sup> 11/4</td>
... </tr>'''
... 
>>> root = html.fromstring(raw)
>>> result = [''.join(txt for txt in td.xpath("a/following-sibling::text()")).strip() \
...             for td in root.xpath("//td")]
... 
>>> result
['v 11/4', 'v 11/4']