内含href的段落段

时间:2019-08-29 07:39:04

标签: xpath web-scraping lxml

这是html:

<p class="myParagraph">
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus vel justo
  <a href="http://google.it" class="small-link" target="_blank">
    <span class="tco-ellipsis"></span>
    <span class="invisible">https://</span>
    <span class="js-display-url">google.it</span>
    <span class="invisible">lpage/events/?ref=page_internal&amp;mt_nav=0&amp;locale2=it_IT</span>
    <span class="tco-ellipsis">
      <span class="invisible">&nbsp;</span>…
    </span>
  </a> ornare, suscipit nisl eget, aliquam augue. Aenean quis pretium
</p>

如果我使用tree.xpath('//p/text()')只会返回我

Lorem ipsum dolor坐着,安全奉献精英。 Vivamus vel justo

代替

Lorem ipsum dolor坐着,安全奉献精英。 Vivamus vel justo ornare,suscipit nisl eget,aliquam augue。艾妮(Aenean)Quis pretium

我还尝试了tree.xpath('string(//p)') here 我如何同时填写完整的段落和href?并非每次都在

段落中有一个a元素

1 个答案:

答案 0 :(得分:0)

xpath('//p/text()')返回字符串列表。连接这些字符串以获得所需的结果。

from lxml import html

doc = """<p class="myParagraph">
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus vel justo
  <a href="http://google.it" class="small-link" target="_blank">
    <span class="tco-ellipsis"></span>
    <span class="invisible">https://</span>
    <span class="js-display-url">google.it</span>
    <span class="invisible">lpage/events/?ref=page_internal&amp;mt_nav=0&amp;locale2=it_IT</span>
    <span class="tco-ellipsis">
      <span class="invisible">&nbsp;</span>…
    </span>
  </a> ornare, suscipit nisl eget, aliquam augue. Aenean quis pretium
</p>"""

root = html.fromstring(doc)
print("".join([t for t in root.xpath("//p/text()")]))