我有一个这些<TD>
的列表,我正在使用列表理解来同时获取它们。
希望在两种情况下提取文本“v 11/4”,即有/无sup
必须提取到单个元素中(对于此行)。
ex 1
<td>
<b class="black">2</b>/6 <a href="/some/link"onclick=
"returnHtml.popup(this," title="whateveryoulike">(ABL TTTTTSSSSSS
11)</a>v 11/4</td>
ex 2
<td>
<b class="black">2</b>/6 <a href="/some/link"onclick=
"returnHtml.popup(this," title="whateveryoulike">(ABL TTTTTSSSSSS
11)</a>v<sup>1</sup> 11/4</td>
有什么想法吗?
答案 0 :(得分:1)
识别文字的一种可能方式&#34; v 11/4&#34;两个<td>
示例的一致性将是{em>&#39;并联<td>
&#39; 之后的所有直接子文本节点(<a>
)的连接。以下是使用lxml.html
的示例实现:
>>> from lxml import html
>>> raw = '''<tr>
... <td>
... <b class="black">2</b>/6 <a href="/some/link" onclick=
... "returnHtml.popup(this," title="whateveryoulike">(ABL TTTTTSSSSSS
... 11)</a>v 11/4</td>
... <td>
... <b class="black">2</b>/6 <a href="/some/link" onclick=
... "returnHtml.popup(this," title="whateveryoulike">(ABL TTTTTSSSSSS
... 11)</a>v<sup>1</sup> 11/4</td>
... </tr>'''
...
>>> root = html.fromstring(raw)
>>> result = [''.join(txt for txt in td.xpath("a/following-sibling::text()")).strip() \
... for td in root.xpath("//td")]
...
>>> result
['v 11/4', 'v 11/4']