我在scrapy中编写了一个脚本,以递归方式抓取网站。但由于某种原因,它无法做到。我已经在sublime中测试了xpath,它运行得很好。所以,在这一点上,我无法解决我做错了什么。
“items.py”包括:
import scrapy
class CraigpItem(scrapy.Item):
Name = scrapy.Field()
Grading = scrapy.Field()
Address = scrapy.Field()
Phone = scrapy.Field()
Website = scrapy.Field()
名为“craigsp.py”的蜘蛛包括:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class CraigspSpider(CrawlSpider):
name = "craigsp"
allowed_domains = ["craigperler.com"]
start_urls = ['https://www.americangemsociety.org/en/find-a-jeweler']
rules=[Rule(LinkExtractor(restrict_xpaths='//area')),
Rule(LinkExtractor(restrict_xpaths='//a[@class="jeweler__link"]'),callback='parse_items')]
def parse_items(self, response):
page = response.xpath('//div[@class="page__content"]')
for titles in page:
AA= titles.xpath('.//h1[@class="page__heading"]/text()').extract()
BB= titles.xpath('.//p[@class="appraiser__grading"]/strong/text()').extract()
CC = titles.xpath('.//p[@class="appraiser__hours"]/text()').extract()
DD = titles.xpath('.//p[@class="appraiser__phone"]/text()').extract()
EE = titles.xpath('.//p[@class="appraiser__website"]/a[@class="appraiser__link"]/@href').extract()
yield {'Name':AA,'Grading':BB,'Address':CC,'Phone':DD,'Website':EE}
我正在运行的命令是:
scrapy crawl craigsp -o items.csv
希望有人能引导我走向正确的方向。
答案 0 :(得分:2)
过滤后的异地请求
此错误表示排队等待scrapy的网址未通过allowed_domains
设置。
你有:
allowed_domains = ["craigperler.com"]
你的蜘蛛正试图爬行http://ww.americangemsociety.org。您需要将其添加到allowed_domains
列表或完全摆脱此设置。