应用错误收集

Scrapy Sgmllinkextractor规则没有抓取所有已定义的链接

时间：2014-05-03 14:01:22

标签： python web-scraping scrapy

我想以下列格式抓取所有链接：

http://example.com/index.php/comments/XXXXX
http://example.com/XXX1/index.php/comments/XXXXX
http://example.com/XXX2/index.php/comments/XXXX
http://example.com/XXX3/index.php/comments/XXXX

我定义了以下规则：

start_urls = ['http://example.com/']

rules = [Rule(SgmlLinkExtractor(allow=[r'\w+/index.php/comments/\w+']), callback='parse_blogpost', follow=True)]

但似乎抓取工具只访问了这样的链接（http://example.com/index.php/comments/XXXXX），但没有像这样的链接（http://example.com/XXX1/index.php/comments/XXXXX）。

任何帮助都会非常有用！

0 个答案:

没有答案