从节点中删除除最后一个孩子的文本之外的文本

时间:2017-07-31 05:15:32

标签: python web-scraping beautifulsoup

我正试图从Goodreads中删除引号。我只需要引用,而不是作者姓名。

以下是HTML源代码。

<div class="quoteText">
      “Don't cry because it's over, smile because it happened.”
  <br>  ―
    <a class="authorOrTitle" href="/author/show/61105.Dr_Seuss">Dr. Seuss</a>
</div>

我在下面尝试过,但它附带了作者信息。

quotes = [quote.text.strip() for quote in soup.findAll('div', {'class':'quoteText'})]

我也尝试使用contents[0],但在多行引号的情况下失败了。见下文:

<div class="quoteText">
      “You've gotta dance like there's nobody watching,
<br>
Love like you'll never be hurt,
<br>
Sing like there's nobody listening,
<br>
And live like it's heaven on earth.”
  <br>  ―
    <a class="authorOrTitle" href="/author/show/1744830.William_W_Purkey">William W. Purkey</a>
</div>

1 个答案:

答案 0 :(得分:1)

这是一个简单的问题,当你quote.text.strip()获得'“Don't cry because it's over, smile because it happened.”\n ―\n Dr. Seuss'时,你可以用\n拆分字符串并仅获得引用。 例: [quote.text.strip().split("\n")[0] for quote in soup.findAll("div", {"class":"quoteText"})]

如果您不想要引号(例如“和”),可以使用""

将其替换为.replace()