Question

我正在抓取网页，我正在使用Beautifulsoup。有一种情况我想跳过一个特定标签的内容并获取其他标签内容。在下面的代码中，我不想要div标签内容。但我无法解决这个问题。请帮我。

HTML code，

<blockquote class="messagetext">
    <div style="margin: 5px; float: right;">
        unwanted text .....
    </div>
    Text..............
    <a class="externalLink" rel="nofollow" target="_blank" href="#">text </a>
    <a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
    <a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
    ,text
</blockquote>

我试过这样，

content = soup.find('blockquote',attrs={'class':'messagetext'}).text

但它也在div标签内取出不需要的文字。

Answer 1

使用clear功能，如下所示：

soup = BeautifulSoup(html_doc)
content = soup.find('blockquote',attrs={'class':'messagetext'})

for tag in content.findChildren():
    if tag.name == 'div':
        tag.clear()

print content.text

这会产生：

Text..............
text 
text
text
   ,text

如何在Beautifulsoup中跳过特定标记并抓取其他标记的文本

1 个答案: