我正在关注学习scrapy的this教程,但我遇到了一个非常奇怪的问题。它会提取网址start_urls
并将其放在data.json
中。这是我使用的代码:
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ArticleItem(scrapy.Item):
url = scrapy.Field()
class ScholarSpider(scrapy.Spider):
name = "scholar"
allowed_domains = ["mininova.org/"]
start_urls = ['http://www.mininova.org/today']
rules = [Rule(LinkExtractor(allow=['/tor/13278067'], deny=['http://www.mininova.org/today']), 'parse')]
def parse(self, response):
article = ArticleItem()
article['url'] = response.url
return article
出于开发目的,我尝试仅提取网址,这也是一个以/tor/13278067
结尾的非常具体的网址。我按照以下方式跑蜘蛛:
$ scrapy crawl scholar -o data.json
我发现data.json
中的内容是:
[{"url": "http://www.mininova.org/today"}]
答案 0 :(得分:1)
要使用rules
,您需要继承CrawlSpider
而不是Spider
。但是,从CrawlSpider
继承时,您不能覆盖parse()
方法,而是为回调使用不同的名称。所以基本上,这就是你需要的:
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
class ArticleItem(scrapy.Item):
url = scrapy.Field()
# inherit from CrawlSpider instead of Spider
class ScholarSpider(CrawlSpider):
name = "scholar"
allowed_domains = ["mininova.org"]
start_urls = ['http://www.mininova.org/today']
rules = [Rule(LinkExtractor(allow=['/tor/\d+']), callback='parse_url')]
# do not use parse() because CrawlSpider
# needs it for his normal operation
def parse_url(self, response):
article = ArticleItem()
article['url'] = response.url
return article