Question

我创建了一个包含5个页面的示例网站：

index.html
  info.html
  countries.html
    france.html
    germany.html

索引页面包含指向信息和国家/地区的链接。国家与法国和德国有联系。法国和德国有一些我要抓的p标签。

我运行一个简单的网络服务器来托管http://localhost:8080上的这个网站。

我用爬虫蜘蛛创建了一个scrapy项目，如下所示。我喜欢看到scrapy开始抓取从索引开始的所有链接。我将正确地创建一个项目类，加载器等来抓取数据但是我不能通过跟踪链接来获取scrapy。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class ServerSpider(CrawlSpider):
    name = "server"
    allowed_domains = ["localhost:8080"]
    start_urls = [
        'http://localhost:8080/index.html'
    ]

    rules = [
        Rule(LinkExtractor(allow='.*'), follow=True, callback='parse_links')
    ]

    def parse_links(self, response)
        print '>>> parse_links'

我应该如何修改蜘蛛？

Scrapy crawlspider解决方案为本地站点

0 个答案: