Question

我创建了一个蜘蛛，在当前页面上找到next_page网址，然后关注并废弃它，再次在该网页上搜索next_page网址并删除它等等。它工作正常，唯一的问题是它跳过start_urls中提到的页面上的报废它总是从下一页开始Scrapy。它应该从当前页面开始抓取，即start_urls然后按照下一页。我知道我错过了什么。请帮我理解下面蜘蛛的错误。

import scrapy

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

class MySpider(CrawlSpider):
  name = "myspider"
  allowed_domains = ["example.com"]
  start_urls = [
      "http://www.example.com/category"
  ]
  rules = (
       Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))
           , follow= True, callback='parse_item'),
  )

def parse_item(self, response):
    item = scrapy.Item()
    item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
    item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
    item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
    return item

Answer 1

尝试将parse_item()功能重命名为parse_start_url()，并相应地重命名规则中的回调。这是一个未记录的函数，在CrawlSpider中为起始URL调用，因此覆盖它会为您提供所需的功能。

您可以在代码中进行以下分配以覆盖该方法：

parse_start_url = parse_item

这样，您的代码可能如下所示：

class MySpider(CrawlSpider):
  name = "myspider"
  allowed_domains = ["example.com"]
  start_urls = [
      "http://www.example.com/category"
  ]
  rules = (
       Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))
           , follow= True, callback='parse_item'),
  )

def parse_item(self, response):
    # process your item here

parse_start_url = parse_item

如何在Scrapy中包含当前页面？

1 个答案: