使用Beautiful Soup在Python中进行Web爬虫的BFS算法?

时间:2016-02-12 18:54:27

标签: python linux algorithm beautifulsoup

我必须创建自己的网络抓取工具(用于教育目的),抓取每个(或尽可能多的)保加利亚网站(.bg域)并使用{{1返回正在运行的服务器Linux shell或curl -I库中的命令。我正在使用一个类似数据库的大型网站,其中包含许多其他网站的链接,这是一个很好的起点。


我真的不知道从哪里开始,所以我对使用example.bg/xyz/xyz/...Beautiful Soup解决此问题的一般算法感兴趣。

1 个答案:

答案 0 :(得分:0)

As you say you'll need to use a graph traversal algorithm as BFS or DFS, for that I would start by thinking a way to couple one of these algorithms for the purpose you want, that basically is mark each of the web sites as visited. I don't know if you are familiar to it. I can give you a link for reference: http://www.geeksforgeeks.org/depth-first-traversal-for-a-graph/

Secondly you can start using Beautiful Soup and implement a way to pull data of interest out of the HTML files.