Scrapy with Splash仍然给予调查:抓取(200)

时间:2018-05-15 13:45:32

标签: python scrapy scrapy-splash

我是scrapy的新手,我似乎无法弄清楚为什么我在运行代码时出现这个问题。我从一个简单的教程编写了这个,然后添加了Splash。启动并运行。

这是代码:

livros.py

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from olx.items import OlxItem
from scrapy_splash import SplashRequest

class LivrosSpider(CrawlSpider):
    name = 'livros'
    allowed_domains = ['www.olx.pt']
    start_urls = ['https://www.olx.pt/lazer/livros-revistas/historia/']

    rules = (
        Rule(LinkExtractor(allow=(), restrict_css=('.pageNextPrev',)),
             callback="parse_item",
             follow=True),)

    def parse_item(self, response):
        item_links = response.css('.large > .detailsLink::attr(href)').extract()
        for a in item_links:
            yield SplashRequest(a, callback=self.parse_detail_page)

    def parse_detail_page(self, response):
        title = response.css('h1::text').extract()[0].strip()
        price = response.css('.pricelabel > strong::text').extract()[0]

        item = OlxItem()
        item['title'] = title
        item['price'] = price
        item['url'] = response.url
        yield item

items.py

import scrapy

class OlxItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()

settings.py

BOT_NAME = 'olx'

SPIDER_MODULES = ['olx.spiders']
NEWSPIDER_MODULE = 'olx.spiders'

FEED_URI = 'data/%(name)s/%(time)s.json'
FEED_FORMAT = 'json'

ROBOTSTXT_OBEY = True

#ScrapySplash settings
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723,
                          'scrapy_splash.SplashMiddleware': 725,
                          'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

以下是我一直在终端上的错误:

2018-05-15 07:47:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=7> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=6)
2018-05-15 07:47:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=8> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=7)
2018-05-15 07:47:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=9> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=8)
2018-05-15 07:47:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=10> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=9)
2018-05-15 07:47:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=11> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=10)
2018-05-15 07:47:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=12> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=11)

最后程序应该将数据保存到json文件中,但文件总是空白。你能帮我弄清楚我错过了什么吗?

1 个答案:

答案 0 :(得分:0)

以下更改适用于我.x-large vs .large

def parse_item(self, response):

    item_links = response.css('.x-large > .detailsLink::attr(href)').extract()
    for a in item_links:
        yield SplashRequest(a, callback=self.parse_detail_page)