Scrapy crawlspider不遵循链接

时间:2020-05-23 12:49:10

标签: scrapy

我正在尝试创建一个爬网程序,该爬网程序将对整个网站进行爬网,并输出所述网站链接到的所有域的列表(无重复项)。

我想出了以下代码:

import scrapy
from crawler.items import CrawlerItem
from crawler.functions import urlToDomain
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class domainSpider(CrawlSpider):
    global allDomains
    allDomains = []
    name = "domainSpider"
    allowed_domains = ["example.com"]
    start_urls = [
        "https://example.com/"
    ]

    rules = (
        Rule(LinkExtractor(), callback='parse', follow=True),
    )

    def parse(self, response):

        urls = response.xpath("//a/@href").extract()
        # formating all URL formats to the same one (https://url.com)
        urlsOk = []
        for elt in urls :
            if elt [:2] == "//" : # link is external, append http
                urlsOk.append(elt)
            elif elt[:4] == "http" :
                urlsOk.append(elt)

        domaines = list(set([urlToDomain(x) for x in urlsOk]))
        item = CrawlerItem()
        item["domaines"] = []
        item["url"] = response.url
        for elt in domaines:
            if elt not in allDomains :
                item['domaines'].append(elt)
                allDomains.append(elt)

                yield({
                    'domaines':elt

                })

这与检索域时的预期工作完全相同,但是在第一页之后它停止爬网(完成)。

1 个答案:

答案 0 :(得分:1)

我正在覆盖内置的CrawlSpider方法(parse),该方法导致了该错误...

解决方案是将回调方法的名称从parse更改为其他名称。

enter image description here