Question

我正在为在线商店写蜘蛛（ CrawlSpider ）。根据客户要求，我需要编写两条规则：一条用于确定哪些页面有项目，另一条用于提取项目。

我有两条规则已经独立完成：

如果我的start_urls = ["www.example.com/books.php", "www.example.com/movies.php"]和我评论Rule和代码 parse_category的{{1}}将提取每个项目。
另一方面，如果parse_item和我评论start_urls = "http://www.example.com"和Rule的代码，parse_item即可返回有提取项目的每个链接，即 parse_category将返回parse_category和 www.example.com/books.php。

我的问题是我不知道如何合并这两个模块，以便www.example.com/movies.php然后start_urls = "http://www.example.com"提取parse_category和www.example.com/books.php并提供这些链接到www.example.com/movies.php，我实际上提取了每个项目的信息。

我需要找到一种方法，而不是仅使用parse_item，因为如果将来添加新类别（例如start_urls = ["www.example.com/books.php", "www.example.com/movies.php"]），蜘蛛将不会能够自动检测新类别，应手动编辑。没什么大不了的，但是客户并不想要这个。

www.example.com/music.php

Answer 1

CrawlSpider规则不会像您想要的那样工作，您需要自己实施逻辑。当您指定follow=True时，您无法使用回调，因为我们的目的是在遵循规则时继续获取链接（无项目），请查看documentation

你可以尝试使用类似的东西：

class StoreSpider (CrawlSpider):
    name = "storyder"

    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com/"]
    # no rules
def parse(self, response): # this is parse_category
    category_le = LinkExtractor("something for categories")
    for a in category_le.extract_links(response):
        yield Request(a.url, callback=self.parse_category)
    item_le = LinkExtractor("something for items")
    for a in item_le.extract_links(response):
        yield Request(a.url, callback=self.parse_item)
def parse_category(self, response):
    category = StoreCategory()
    # some code for determining whether the current page is a category, or just another stuff 
    if is a category:
        category['name'] = name
        category['url'] = response.url
        yield category
    for req in self.parse(response):
        yield req
def parse_item(self, response):
    item = StoreItem()
    # some code for extracting the item's data
    return item

Answer 2

我没有使用parse_category，而是使用restrict_css中的LinkExtractor来获取我想要的链接，而它似乎正在用提取的链接提供第二个Rule ，所以我的问题得到了回答。它最终以这种方式结束：

class StoreSpider (CrawlSpider):
    name = "storyder"

    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com/"]

    rules = (
        Rule(LinkExtractor(restrict_css=("#movies", "#books"))),
        Rule(LinkExtractor(), callback="parse_item"),
    )

def parse_item(self, response):
    item = StoreItem()
    # some code for extracting the item's data
    return item

仍然无法检测新添加的类别（并且没有明确的模式可以在restrict_css中使用而无需获取其他垃圾），但至少它符合以下条件的必要条件：客户：2个规则，一个用于提取类别的链接，另一个用于提取项目的数据。

如何用蜘蛛内爬行的链接喂养蜘蛛？

2 个答案: