Question

我最终应遵循的规则是：http://www.lecture-en-ligne.com/towerofgod/168/0/0/1.html

碎片从源头获取相对URL：

<a class="table" href="../../towerofgod/168/0/0/1.html">Lire en ligne</a>

然后它严重爬行，认为双点斜线双点是下一个网址的一部分......

我应该使用自定义process_value转换从LxmlLinkExtractor获取的双重相对URL吗？

scrapy是否正确处理相对url，我的意思是它的预期行为？

2014-12-06 17:20:05 + 0100 [togspider] DEBUG：Crawled（200）http://www.lecture-en-ligne.com/manga/towerofgod/> （引用者：无）

2014-12-06 17:20:05 + 0100 [togspider] DEBUG：正在重试http://www.lecture-en-ligne.com/../../towerofgod/160/0/0/1 html的＆GT; （失败1次）：400错误请求

class TogSpider(CrawlSpider):
name = "togspider"
allowed_domains = ["lecture-en-ligne.com"]
start_urls = ["http://www.lecture-en-ligne.com/manga/towerofgod/"]

rules = (
    Rule(LxmlLinkExtractor(allow_domains=allowed_domains,
                           restrict_xpaths='.//*[@id="page"]/table[2]/tbody/tr[10]/td[2]/a'), callback='parse_chapter'),
    )

Answer 1

问题是HTML的HTML base element不正确，应该为页面中的所有相关链接指定基本网址：

<base href="http://www.lecture-en-ligne.com/"/>

Scrapy尊重这一点，这就是为什么链接以这种方式形成的原因。

scrapy LxmlLinkExtractor和相对网址

1 个答案: