Question

我正在使用Scrapy CrawlSpider来抓取网站。它使用start_url函数从parse获取信息。它还有一条规则来抓取任何包含“pageNumber =”的链接，并从这些页面中获取信息。

我的问题：如果我覆盖parse函数，蜘蛛不会遵循/执行规则。如果我注释掉parse函数，则遵循规则。

我知道你不应该将parse函数称为规则中的回调而我不是，但我的parse函数和我的规则回调都调用相同的函数。这可能会导致问题吗？

出了什么问题，如何让蜘蛛跟随/执行规则？

我的代码：

class BusinessFinderSpider(CrawlSpider):

    name = "Business_Finder"
    allowed_domains = ["yellowpages.com.au"]
    start_urls = ["http://www.yellowpages.com.au/search/listings?clue=abc&locationClue=5000&selectedViewMode=list&eventType=sort&sortBy=distance"]
    rules = [Rule(LinkExtractor(allow=['/search/listings.*pageNumber=']), callback="parse_sub_page")]

    # if i comment out the below function then the rules are followed
    def parse(self, response):
        return self.parse_business_list_page(response)

    def parse_sub_page(self, response):
        return self.parse_business_list_page(response)

    def parse_business_list_page(self, response):

        businesses = []
        business_divs = response.xpath("//div[contains(@class, 'cell') and contains(@class, 'in-area-cell') and contains(@class, 'middle-cell')]")
        main_industry = re.search("(clue=)(.*?(?=&))", response.url).group(2)

        for business_div in business_divs:
            business = BusinessFinderItem()
            business["name"] = business_div.xpath(".//a[@class='listing-name']/text()").extract()
            ...
            businesses.append(business)

        return businesses

Answer 1

主要有两种类型的蜘蛛BaseSpider和CrawlSpider，

如果您要覆盖parse()，请在抓取蜘蛛中自动禁用rules部分。

尝试从Spider而不是CrawlSpider继承蜘蛛以使用parse()，但是当你这样做时，你必须使用{{1}手动提取urls } {或者xpath中的某些内容。之后必须使用回调函数向parse() Request提供urls。在回调函数中，您将获得响应并提取内容。

如果Parse被覆盖，Spider不遵循规则

1 个答案: