我有如下的html,想要消除标签中的文字a href
<td>BetaShares Managed Risk Global Share Fund</td>
<td class="text-center"><a href="/asx/wrld" target="_blank">WRLD</a></td>
<td class="text-center">0.39%</td>
<td class="text-center">N/A</td>
<td>A broadly diversified portfolio of global shares - <a href="http://www.betashares.com.au/products/name/managed-risk-global-share-fund" target="_blank">Link</a></td>
</tr><tr><td><img alt="iShares Logo" src="/sites/default/files/etfs/logos/ishares-logo-icon.png" /></td>
<td>iShares Core MSCI World All Cap</td>
<td class="text-center"><a href="/asx/iwld" target="_blank">IWLD</a></td>
<td class="text-center">0.16%</td>
<td class="text-center">MSCI World Investible Market</td>
<td>Covers large, mid and small-capitalisation stocks across developed markets which comply with MSCI's size, liquidity, and free-float criteria - <a href="https://www.blackrock.com/au/intermediaries/products/283117/" target="_blank">Link</a></td>
</tr><tr><td><img alt="iShares Logo" src="/sites/default/files/etfs/logos/ishares-logo-icon.png" /></td>
我想要的输出是
BetaShares Managed Risk Global Share Fund,WRLD,iShares Core MSCI World All Cap,IWLD
我尝试
output = tree.xpath('//td[not(@class)][not(contains(.,"href"))]/text()')
但它会返回不受欢迎的答案。
BetaShares Managed Risk Global Share Fund,A broadly diversified portfolio of global shares -,iShares Core MSCI World All Cap,MSCI World Investible Market
答案 0 :(得分:0)
尝试使用以下XPath来获取所需的文本节点:
//tr/td[not(@class) and not(a)]/text() | //tr/td/a[not(../preceding-sibling::td[@class])]/text()
这应该只返回td
没有class
属性的文本节点,并从每个表格行的第一个链接链接子和文本节点
由于tree.xpath()
打算返回文本节点列表,您可以使用下面的方法将所需文本作为单个字符串获取:
output = ", ".join(tree.xpath('//tr/td[not(@class) and not(a)]/text() | //tr/td/a[not(../preceding-sibling::td[@class])]/text()'))