Question

我已经阅读了所有其他相关主题，但我无法将它们组合成一个有效的解决方案。我只是无法阅读段落的全部内容。如果它包含＆lt; span＆gt;我根本没有收到包含文字，如果它包含＆lt; a href＆gt;链接我只获得没有实际URL的链接文本。有人可以帮忙吗？

示例：

tree = etree.HTML('<div id="classy"><span>this is </span><p>Some text, then this link: <a href="http://www.violentpower.com/" target="_blank" rel="nofollow"> Insane website</a> and some more text here.</p></div>')
_result = tree.xpath('//div[@id="classy"]//descendant::p')

for article in _result:
    _output = etree.tostring(article, pretty_print=True)

print _output

我希望得到这个结果：

这是一些文字，然后是这个链接：http：//www.violentpower.com/疯狂网站以及更多文字。

...但我得到了这个：
＆lt; p＆gt;一些文字，然后是这个链接：疯狂的网站和其他一些文字。＆lt; / p＆gt;

Answer 1

如果选择

result = tree.xpath("//div[@id = 'classy']//text() | //div[@id = 'classy']//@href")

你得到一个清单

['this is ', 'Some text, then this link: ', 'http://www.violentpower.com/', ' Insane website', ' and some more text here.']

然后，您可以使用

将所有字符串合并为一个字符串

''.join(result)

如何使用XPath从<p>元素中获取所有文本，包括任何href链接及其链接文本

1 个答案: