设置两个规则时,Scrapy无法递归爬网

时间:2017-04-07 20:29:05

标签: python scrapy

I've tried to upload what I can see in the console我在scrapy中编写了一个脚本,以递归方式抓取网站。但由于某种原因,它无法做到。我已经在sublime中测试了xpath,它运行得很好。所以,在这一点上,我无法解决我做错了什么。

“items.py”包括:

import scrapy
class CraigpItem(scrapy.Item):
    Name = scrapy.Field()
    Grading = scrapy.Field()
    Address = scrapy.Field()
    Phone = scrapy.Field()
    Website = scrapy.Field()

名为“craigsp.py”的蜘蛛包括:

from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor

class CraigspSpider(CrawlSpider):
    name = "craigsp"
    allowed_domains = ["craigperler.com"]
    start_urls = ['https://www.americangemsociety.org/en/find-a-jeweler']
    rules=[Rule(LinkExtractor(restrict_xpaths='//area')),
               Rule(LinkExtractor(restrict_xpaths='//a[@class="jeweler__link"]'),callback='parse_items')]    

    def parse_items(self, response):
        page = response.xpath('//div[@class="page__content"]')
        for titles in page:
            AA= titles.xpath('.//h1[@class="page__heading"]/text()').extract()
            BB= titles.xpath('.//p[@class="appraiser__grading"]/strong/text()').extract()
            CC = titles.xpath('.//p[@class="appraiser__hours"]/text()').extract()
            DD = titles.xpath('.//p[@class="appraiser__phone"]/text()').extract()
            EE = titles.xpath('.//p[@class="appraiser__website"]/a[@class="appraiser__link"]/@href').extract()
            yield {'Name':AA,'Grading':BB,'Address':CC,'Phone':DD,'Website':EE}

我正在运行的命令是:

scrapy crawl craigsp -o items.csv

希望有人能引导我走向正确的方向。

1 个答案:

答案 0 :(得分:2)

  

过滤后的异地请求

此错误表示排队等待scrapy的网址未通过allowed_domains设置。

你有:

allowed_domains = ["craigperler.com"]

你的蜘蛛正试图爬行http://ww.americangemsociety.org。您需要将其添加到allowed_domains列表或完全摆脱此设置。