如何让这个Web Crawler无限?

时间:2015-08-18 11:54:52

标签: python web-scraping beautifulsoup web-crawler python-requests

这是我正在尝试编写的代码,(一个循环遍历链接列表的网络爬虫,其中第一个链接是原始链接,然后网站上的链接附加到列表中,for循环保持不变通过列表,出于某种原因,当附加并打印约150个链接时脚本会一直停止

import requests
from bs4 import BeautifulSoup
import urllib.request

links = ['http://example.com']
def spider(max_pages):
    page = 1
    number = 1
    while page <= max_pages:
        try:
            for LINK in links:
                url = LINK
                source_code = requests.get(url)
                plain_text = source_code.text
                soup = BeautifulSoup(plain_text, "html.parser")
                for link in soup.findAll("a"):
                    try:
                        href = link.get("href")
                        if href.startswith("http"):
                            if href not in links:
                                number += 1
                                links.append(href)
                                print("{}: {}".format(number, href))
                    except:
                        pass

        except Exception as e:
            print(e)

while True:
    spider(10000)

我该怎样做才能让它无限?

2 个答案:

答案 0 :(得分:2)

当您发现没有<a>属性的href元素时,会发生该错误。在尝试使用startwith之前,您应该检查链接是否确实具有href。

答案 1 :(得分:0)

Samir Chahine,

您的代码失败,因为

中的href变量为none
href = link.get("href")

所以另外检查一下:

if (href is not none) and href.startswith("http://")

Plz转换python代码中的逻辑

    try to debug using print statement like :



href = link.get("href")
                        print("href "+ href)
                        if href is not none and href.startswith("http"):
                            print("Condition passed 1")
                            if href not in links:
                                print("Condition passed 2")
                                number += 1
                                links.append(href)
                                print("{}: {}".format(number, href))