这是html:
<p class="myParagraph">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus vel justo
<a href="http://google.it" class="small-link" target="_blank">
<span class="tco-ellipsis"></span>
<span class="invisible">https://</span>
<span class="js-display-url">google.it</span>
<span class="invisible">lpage/events/?ref=page_internal&mt_nav=0&locale2=it_IT</span>
<span class="tco-ellipsis">
<span class="invisible"> </span>…
</span>
</a> ornare, suscipit nisl eget, aliquam augue. Aenean quis pretium
</p>
如果我使用tree.xpath('//p/text()')
只会返回我
Lorem ipsum dolor坐着,安全奉献精英。 Vivamus vel justo
代替
Lorem ipsum dolor坐着,安全奉献精英。 Vivamus vel justo ornare,suscipit nisl eget,aliquam augue。艾妮(Aenean)Quis pretium
我还尝试了tree.xpath('string(//p)')
here
我如何同时填写完整的段落和href?并非每次都在
a
元素
答案 0 :(得分:0)
xpath('//p/text()')
返回字符串列表。连接这些字符串以获得所需的结果。
from lxml import html
doc = """<p class="myParagraph">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus vel justo
<a href="http://google.it" class="small-link" target="_blank">
<span class="tco-ellipsis"></span>
<span class="invisible">https://</span>
<span class="js-display-url">google.it</span>
<span class="invisible">lpage/events/?ref=page_internal&mt_nav=0&locale2=it_IT</span>
<span class="tco-ellipsis">
<span class="invisible"> </span>…
</span>
</a> ornare, suscipit nisl eget, aliquam augue. Aenean quis pretium
</p>"""
root = html.fromstring(doc)
print("".join([t for t in root.xpath("//p/text()")]))