Question

我正在尝试使用此规则解析论坛：

rules = (Rule(LinkExtractor(allow=(r'page-\d+$')), callback='parse_item', follow=True),)

我在开始时尝试了几种有/无r的方法，在模式结束时有/无$等等，但每次scrapy都会生成以等号结尾的链接，即使在链接中没有=也没有页面也没有图案。

有一个提取链接的例子（也使用parse_start_url，所以起始网址也在这里，是的，我试图删除它 - 它没有帮助）：

[<GET http://www.example.com/index.php?threads/topic.0000/>,
 <GET http://www.example.com/index.php?threads%2Ftopic.0000%2Fpage-2=>,
 <GET http://www.example.com/index.php?threads%2Ftopic.0000%2Fpage-3=>]

如果我在浏览器中打开或在scrapy shell中获取这些链接，我会得到错误的页面，无需解析，但删除相同的符号可以解决问题。

那么为什么会这样呢？我该如何处理呢？

编辑1（附加信息）：

Scrapy 1.0.3;
其他CrawlSpiders很好。

编辑2：

蜘蛛的代码：

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request


class BmwclubSpider(CrawlSpider):

    name = "bmwclub"
    allowed_domains = ["www.bmwclub.ru"]
    start_urls = []
    start_url_objects = []

    rules = (Rule(LinkExtractor(allow=(r'page-\d+$')), callback='parse_item'),)

    def parse_start_url(self, response):
        return Request(url = response.url, callback=self.parse_item, meta={'site_url': response.url})

    def parse_item(self, response):
        return []

收集链接的命令：

scrapy parse http://www.bmwclub.ru/index.php?threads/bamper-novyj-x6-torg-umesten-150000rub.1051898/ --noitems --spider bmwclub

命令输出：

>>> STATUS DEPTH LEVEL 1 <<<
# Requests  -----------------------------------------------------------------
[<GET http://www.bmwclub.ru/index.php?threads/bamper-novyj-x6-torg-umesten-150000rub.1051898/>,
 <GET http://www.bmwclub.ru/index.php?threads%2Fbamper-novyj-x6-torg-umesten-150000rub.1051898%2Fpage-2=>,
 <GET http://www.bmwclub.ru/index.php?threads%2Fbamper-novyj-x6-torg-umesten-150000rub.1051898%2Fpage-3=>]

Answer 1

这是因为规范化问题。

你可以在LinkExtractor上禁用它：

rules = (
    Rule(LinkExtractor(allow=(r'page-\d+$',), canonicalize=False), callback='parse_item'),
)

scrapy link extractor在链接末尾添加等号

1 个答案: