Question

我需要从大约2000个网站抓取纯文本，这些网站没有通用的网页结构，我认为可能很难用一个脚本抓取。

作为'第一次爬行'，我在BeautifulSoup上做了几次尝试和错误。目前，我设法通过查看某些标记（＆lt; p＆gt;和所有标题标记）之间的内容以及以下内容来抓取一些纯文本：

soup.findAll(['p', re.compile('h[0-9]'), 'title'])

然而，有时会有一些rss / news-feed从中我不想拥有该文本。从我在页面的源代码中看到的，它被css div-class包围。所以我的问题是如果我可以告诉上面的命令不要抓取文本，如果它被某个div类包围。

Answer 1

您可以将功能用作过滤器：

`def my_filter(tag):
return (tag.name == 'p' or re.compile('h[0-9]').match(tag.name) or tag.name== 'title') and (tag.parent['class'] != 'certain_div_class' or not tag.parent.has_attr('class')) 

my_tags = soup.findAll(my_filter)`

BeautifulSoup - 如果在某个div类中，请不要抓取标记

1 个答案: