Scrapy没有抓取任何URL

时间:2019-02-27 17:53:08

标签: python scrapy-spider

我将代码放在scrapy shell中,以测试我的xpath,一切似乎正常。但是我看不到为什么是0爬网。这是日志输出:

  

2019-02-27 18:04:47 [scrapy.utils.log]信息:Scrapy 1.5.1已启动   (机器人:jumia)2019-02-27 18:04:47 [scrapy.utils.log]信息:版本:   lxml 4.3.0.0,libxml2 2.9.9,cssselect 1.0.3,parsel 1.5.1,w3lib   1.20.0,Twisted 18.9.0,Python 2.7.15+(默认值,2018年11月28日,16:27:22)-[GCC 8.2.0],pyOpenSSL 18.0.0(OpenSSL 1.1.0j 11月20日)   2018),加密2.4.2,平台   Linux-4.19.0-kali1-amd64-x86_64-with-Kali-kali-rolling-kali-rolling   2019-02-27 18:04:47 [scrapy.crawler]信息:覆盖的设置:   {'NEWSPIDER_MODULE':'jumia.spiders','SPIDER_MODULES':   ['jumia.spiders'],'ROBOTSTXT_OBEY':是,'BOT_NAME':'jumia'}   2019-02-27 18:04:47 [scrapy.middleware]信息:启用的扩展程序:   ['scrapy.extensions.memusage.MemoryUsage',   'scrapy.extensions.logstats.LogStats',   'scrapy.extensions.telnet.TelnetConsole',   'scrapy.extensions.corestats.CoreStats'] 2019-02-27 18:04:47   [scrapy.middleware] INFO:已启用下载程序中间件:   ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',   'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',   “ scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware”,   'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',   'scrapy.downloadermiddlewares.retry.RetryMiddleware',   'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',   “ scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware”,   'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',   'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',   'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',   'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-02-27   18:04:47 [scrapy.middleware]信息:已启用蜘蛛中间件:   ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',   'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',   'scrapy.spidermiddlewares.referer.RefererMiddleware',   'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',   [scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-02-27 18:04:47   [scrapy.middleware] INFO:启用的项目管道:[] 2019-02-27   18:04:47 [scrapy.core.engine]信息:蜘蛛开放2019-02-27 18:04:47   [scrapy.extensions.logstats]信息:抓取了0页(以0页/分钟的速度),   刮0件(以0件/分钟)2019-02-27 18:04:47   [scrapy.extensions.telnet]调试:Telnet控制台正在监听   127.0.0.1:6029 2019-02-27 18:04:47 [scrapy.core.engine]信息:关闭蜘蛛(已完成)2019-02-27 18:04:47 [scrapy.statscollectors]信息:   弃用Scrapy统计信息:{'finish_reason':'finished','finish_time':   datetime.datetime(2019,2,27,17,4,47,950397),'log_count / DEBUG':   1,'log_count / INFO':7,'memusage / max':53383168,   'memusage / startup':53383168,'start_time':datetime.datetime(2019,   2,27,17,4,47,947520)} 2019-02-27 18:04:47 [scrapy.core.engine]   INFO:蜘蛛关闭(完成)

这是我的蜘蛛代码:

    import scrapy
    from scrapy.loader import ItemLoader
    from scrapy.loader.processors import MapCompose
    from scrapy.loader.processors import TakeFirst
    from jumia.items import JumiaItem


    class ProductDetails (scrapy.Spider):
        name = "jumiaProject"
        start_url = ["https://www.jumia.com.ng/computing/hp/"]

        def parse (self, response):

            search_results = response.css('section.products.-mabaya > div')

            for product in search_results: 

                product_loader = ItemLoader(item=JumiaItem(), selector=product)

                product_loader.add_css('brand','h2.title > span.brand::text')

                product_loader.add_css('name', 'h2.title > span.name::text')

                product_loader.add_css('link', 'a.link::attr(href)')


                yield product_loader.load_item()

这是我的items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy.loader.processors import MapCompose
class JumiatesteItem(scrapy.Item):
    # define the fields for your item here like:
    name  = scrapy.Field()
    brand = scrapy.Field()
    price = scrapy.Field()
    link  = scrapy.Field()

1 个答案:

答案 0 :(得分:0)

Spider中正确的变量名称应为start_urls,而不是start_url。由于名称错误,它无法检测到任何URL。