Scrapy Follow&刮下一页

时间:2015-03-02 08:47:17

标签: python python-2.7 web-scraping scrapy

我遇到的问题是,我的scrapy蜘蛛都不会抓取一个网站,只抓一页并抓住。我的印象是rules成员变量对此负责,但我不能让它遵循任何链接。我一直在关注此处的文档:http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

我能错过什么让我的机器人不能爬行?

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector

from Example.items import ExItem

class ExampleSpider(CrawlSpider):
    name = "example"
    allowed_domains = ["example.ac.uk"]
    start_urls = (
        'http://www.example.ac.uk',
    )

    rules = ( Rule (LinkExtractor(allow=("", ),),
                    callback="parse_items",  follow= True),
    )

1 个答案:

答案 0 :(得分:3)

用这个替换你的规则:

rules = ( Rule(LinkExtractor(allow=('course-finder', ),restrict_xpaths=('//div[@class="pagination"]',)), callback='parse_items',follow=True), )