Question

我正在使用Scrapy从网站中提取一些数据，比如“myproject.com”。这是逻辑：

转到主页，有一些categorylist用于构建第二波链接。
对于第二轮链接，它们通常是每个类别的第一页。此外，对于该类别中的不同页面，它们遵循相同的正则表达式模式wholesale/something/something/request or wholesale/pagenumber。我想跟随这些模式继续爬行，同时将原始HTML存储在我的项目对象中。

我使用parse分别测试了这两个步骤，但两者都有效。

首先，我试过了：

scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules

我可以看到它成功构建了外链。然后我再次测试了内置的外链。

scrapy parse http://www.myproject.com/wholesale/cat_a/request/1 --spider myproject --rules

似乎规则是正确的，它会生成一个存储在那里的HTML项目。

但是，当我尝试使用depth参数将这两个步骤链接在一起时。我看到它爬行了外链，但没有生成任何项目。

scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules --depth 2

这是伪代码：

class MyprojectSpider(CrawlSpider):
    name = "Myproject"
    allowed_domains = ["Myproject.com"]
    start_urls = ["http://www.Myproject.com/"]

    rules = (
        Rule(LinkExtractor(allow=('/categorylist/\w+',)), callback='parse_category', follow=True),
        Rule(LinkExtractor(allow=('/wholesale/\w+/(?:wholesale|request)/\d+',)), callback='parse_pricing', follow=True),
    )

    def parse_category(self, response):
        try:
            soup = BeautifulSoup(response.body)
            ...
            my_request1 = Request(url=myurl1)
            yield my_request1
            my_request2 = Request(url=myurl2)
            yield my_request2
        except:
            pass

    def parse_pricing(self, response):
        item = MyprojectItem()
        try:
            item['myurl'] = response.url
            item['myhtml'] = response.body
            item['mystatus'] = 'fetched'
        except:
            item['mystatus'] = 'failed'
        return item

非常感谢任何建议！

Answer 1

我假设我构建的新Request对象将针对rules运行，然后由规则中相应的回调函数定义，但是，在阅读documentation之后在请求中，callback方法以不同的方式处理。

class scrapy.http.Request（url [，callback，method ='GET'，header，body，cookies，meta，encoding ='utf-8'，priority = 0，dont_filter = False，errback]）

callback（callable） - 将使用此请求的响应（一旦下载）调用的函数作为其第一个参数。有关更多信息，请参阅下面将其他数据传递给回调函数。 如果请求未指定回调，则将使用spider的parse（）方法 。请注意，如果在处理期间引发异常，则会调用errback。

...
my_request1 = Request(url=myurl1, callback=self.parse_pricing)
yield my_request1
my_request2 = Request(url=myurl2, callback=self.parse_pricing)
yield my_request2
...

另一方面，即使我构建的网址与第二条规则匹配，也不会将其传递给parse_pricing。希望这对其他人有帮助。

Scrapy Deploy与调试结果不匹配

1 个答案: