我正在尝试使用此规则解析论坛:
rules = (Rule(LinkExtractor(allow=(r'page-\d+$')), callback='parse_item', follow=True),)
我在开始时尝试了几种有/无r的方法,在模式结束时有/无$等等,但每次scrapy都会生成以等号结尾的链接,即使在链接中没有=也没有页面也没有图案。
有一个提取链接的例子(也使用parse_start_url,所以起始网址也在这里,是的,我试图删除它 - 它没有帮助):
[<GET http://www.example.com/index.php?threads/topic.0000/>,
<GET http://www.example.com/index.php?threads%2Ftopic.0000%2Fpage-2=>,
<GET http://www.example.com/index.php?threads%2Ftopic.0000%2Fpage-3=>]
如果我在浏览器中打开或在scrapy shell中获取这些链接,我会得到错误的页面,无需解析,但删除相同的符号可以解决问题。
那么为什么会这样呢?我该如何处理呢?
编辑1(附加信息):
编辑2:
蜘蛛的代码:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
class BmwclubSpider(CrawlSpider):
name = "bmwclub"
allowed_domains = ["www.bmwclub.ru"]
start_urls = []
start_url_objects = []
rules = (Rule(LinkExtractor(allow=(r'page-\d+$')), callback='parse_item'),)
def parse_start_url(self, response):
return Request(url = response.url, callback=self.parse_item, meta={'site_url': response.url})
def parse_item(self, response):
return []
收集链接的命令:
scrapy parse http://www.bmwclub.ru/index.php?threads/bamper-novyj-x6-torg-umesten-150000rub.1051898/ --noitems --spider bmwclub
命令输出:
>>> STATUS DEPTH LEVEL 1 <<<
# Requests -----------------------------------------------------------------
[<GET http://www.bmwclub.ru/index.php?threads/bamper-novyj-x6-torg-umesten-150000rub.1051898/>,
<GET http://www.bmwclub.ru/index.php?threads%2Fbamper-novyj-x6-torg-umesten-150000rub.1051898%2Fpage-2=>,
<GET http://www.bmwclub.ru/index.php?threads%2Fbamper-novyj-x6-torg-umesten-150000rub.1051898%2Fpage-3=>]
答案 0 :(得分:1)
这是因为规范化问题。
你可以在LinkExtractor
上禁用它:
rules = (
Rule(LinkExtractor(allow=(r'page-\d+$',), canonicalize=False), callback='parse_item'),
)