我是scrapy的新手,我似乎无法弄清楚为什么我在运行代码时出现这个问题。我从一个简单的教程编写了这个,然后添加了Splash。启动并运行。
这是代码:
livros.py
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from olx.items import OlxItem
from scrapy_splash import SplashRequest
class LivrosSpider(CrawlSpider):
name = 'livros'
allowed_domains = ['www.olx.pt']
start_urls = ['https://www.olx.pt/lazer/livros-revistas/historia/']
rules = (
Rule(LinkExtractor(allow=(), restrict_css=('.pageNextPrev',)),
callback="parse_item",
follow=True),)
def parse_item(self, response):
item_links = response.css('.large > .detailsLink::attr(href)').extract()
for a in item_links:
yield SplashRequest(a, callback=self.parse_detail_page)
def parse_detail_page(self, response):
title = response.css('h1::text').extract()[0].strip()
price = response.css('.pricelabel > strong::text').extract()[0]
item = OlxItem()
item['title'] = title
item['price'] = price
item['url'] = response.url
yield item
items.py
import scrapy
class OlxItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
price = scrapy.Field()
url = scrapy.Field()
settings.py
BOT_NAME = 'olx'
SPIDER_MODULES = ['olx.spiders']
NEWSPIDER_MODULE = 'olx.spiders'
FEED_URI = 'data/%(name)s/%(time)s.json'
FEED_FORMAT = 'json'
ROBOTSTXT_OBEY = True
#ScrapySplash settings
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
以下是我一直在终端上的错误:
2018-05-15 07:47:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=7> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=6)
2018-05-15 07:47:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=8> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=7)
2018-05-15 07:47:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=9> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=8)
2018-05-15 07:47:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=10> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=9)
2018-05-15 07:47:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=11> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=10)
2018-05-15 07:47:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=12> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=11)
最后程序应该将数据保存到json文件中,但文件总是空白。你能帮我弄清楚我错过了什么吗?
答案 0 :(得分:0)
以下更改适用于我.x-large
vs .large
:
def parse_item(self, response):
item_links = response.css('.x-large > .detailsLink::attr(href)').extract()
for a in item_links:
yield SplashRequest(a, callback=self.parse_detail_page)