Scrapy CrawlSpider无法加入

时间:2018-07-24 21:36:46

标签: python web-scraping scrapy scrapy-spider

我在这里和其他网站上读过很多有关刮y的文章,我无法解决此问题,所以我问你:P希望有人可以帮助我。

我想在主客户端页面上验证登录名,然后解析所有类别,然后解析所有产品,并保存产品标题,其类别,数量和价格。

我的代码:

# -*- coding: utf-8 -*-

import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import logging

class article(Item):
    category = Field()
    title = Field()
    quantity = Field()
    price = Field()

class combatzone_spider(CrawlSpider):
    name = 'combatzone_spider'
    allowed_domains = ['www.combatzone.es']
    start_urls = ['http://www.combatzone.es/areadeclientes/']

    rules = (
        Rule(LinkExtractor(allow=r'/category.php?id=\d+'),follow=True),
        Rule(LinkExtractor(allow=r'&page=\d+'),follow=True),
        Rule(LinkExtractor(allow=r'goods.php?id=\d+'),follow=True,callback='parse_items'),
    )

def init_request(self):
    logging.info("You are in initRequest")
    return Request(url=self,callback=self.login)

def login(self,response):
    logging.info("You are in login")
    return scrapy.FormRequest.from_response(response,formname='ECS_LOGINFORM',formdata={'username':'XXXX','password':'YYYY'},callback=self.check_login_response)

def check_login_response(self,response):
    logging.info("You are in checkLogin")
    if "Hola,XXXX" in response.body:
        self.log("Succesfully logged in.")
        return self.initialized()
    else:
        self.log("Something wrong in login.")

def parse_items(self,response):
    logging.info("You are in item")
    item = scrapy.loader.ItemLoader(article(),response)
    item.add_xpath('category','/html/body/div[3]/div[2]/div[2]/a[2]/text()')
    item.add_xpath('title','/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()')
    item.add_xpath('quantity','//*[@id="ECS_FORMBUY"]/div[1]/ul/li[2]/font/text()')
    item.add_xpath('price','//*[@id="ECS_RANKPRICE_2"]/text()')
    yield item.load_item()

当我在终端上运行抓痒的爬行蜘蛛时,我得到了:

  
    

SCRAPY)pi @ raspberry:〜/ SCRAPY / combatzone / combatzone / spiders $ scrapy爬行fightingzone_spider     /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9:     ScrapyDeprecationWarning:模块scrapy.contrib.spiders是     弃用,请改用scrapy.spiders     scrapy.contrib.spiders.init导入InitSpider     /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9:     ScrapyDeprecationWarning:模块scrapy.contrib.spiders.init是     弃用,请改用scrapy.spiders.init     scrapy.contrib.spiders.init导入InitSpider 2018-07-24 22:14:53     [scrapy.utils.log]信息:Scrapy 1.5.1已启动(机器人:battlezone)     2018-07-24 22:14:53 [scrapy.utils.log]信息:版本:lxml 4.2.3.0,     libxml2 2.9.8,cssselect 1.0.3,parsel 1.5.0,w3lib 1.19.0,Twisted     18.7.0,Python 2.7.13(默认值,2017年11月24日17:33:09)-[GCC 6.3.0 20170516],pyOpenSSL 18.0.0(OpenSSL 1.1.0h 2018年3月27日),     加密技术2.3,平台Linux-4.9.0-6-686-i686-with-debian-9.5     2018-07-24 22:14:53 [scrapy.crawler]信息:覆盖的设置:     {'NEWSPIDER_MODULE':'combatzone.spiders','SPIDER_MODULES':     ['combatzone.spiders'],'LOG_LEVEL':'INFO','BOT_NAME':'combatzone'}     2018-07-24 22:14:53 [scrapy.middleware]信息:启用的扩展:     ['scrapy.extensions.memusage.MemoryUsage',     'scrapy.extensions.logstats.LogStats',     'scrapy.extensions.telnet.TelnetConsole',     'scrapy.extensions.corestats.CoreStats'] 2018-07-24 22:14:53     [scrapy.middleware] INFO:已启用下载程序中间件:     ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',     “ scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware”,     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',     'scrapy.downloadermiddlewares.retry.RetryMiddleware',     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',     “ scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware”,     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',     'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-07-24     22:14:53 [scrapy.middleware]信息:已启用蜘蛛中间件:     ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',     'scrapy.spidermiddlewares.referer.RefererMiddleware',     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',     [scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-07-24 22:14:53     [scrapy.middleware] INFO:启用的项目管道:[] 2018-07-24     22:14:53 [scrapy.core.engine] 信息:蜘蛛开了 2018-07-24     22:14:53 [scrapy.extensions.logstats] 信息:抓取0页(在0处     页/分钟),抓取0个项目(以0个项目/分钟的速度) 2018-07-24 22:14:54     [scrapy.core.engine] 信息:闭合蜘蛛(已完成) 2018-07-24     22:14:54 [scrapy.statscollectors]信息:倾销Scrapy统计信息:     {'downloader / request_bytes':231,'downloader / request_count':1,     “ downloader / request_method_count / GET”:1,     'downloader / response_bytes':7152,'downloader / response_count':1,     'downloader / response_status_count / 200':1,'finish_reason':     'finished','finish_time':datetime.datetime(2018,7,24,21,14,54,     410938),'log_count / INFO':7,'memusage / max':36139008,     'memusage / startup':36139008,'response_received_count':1,     '调度程序/出队/内存':1,'调度程序/出队/内存':1,     '调度程序/排队/内存':1,'调度程序/排队/内存':1     'start_time':datetime.datetime(2018、7、24、21、14、53、998619)}     2018-07-24 22:14:54 [scrapy.core.engine]信息:蜘蛛关闭     (完成)

  

蜘蛛似乎没有工作,为什么会这样呢? 非常感谢队友:D

1 个答案:

答案 0 :(得分:1)

有2个问题:

  • 第一个是正则表达式,应转义“?”。例如:/category.php?id=\d+应该更改为/category.php\?id=\d+(注意“ \?”
  • 第二个是您应该缩进所有方法,否则它们将在类warzone_spider类中找不到。

关于登录,我试图使您的代码正常工作,但失败了。我通常会覆盖start_requests以便在抓取之前登录。

代码如下:

# -*- coding: utf-8 -*-

import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import logging

class article(Item):
    category = Field()
    title = Field()
    quantity = Field()
    price = Field()

class CombatZoneSpider(CrawlSpider):
    name = 'CombatZoneSpider'
    allowed_domains = ['www.combatzone.es']
    start_urls = ['http://www.combatzone.es/areadeclientes/']

    rules = (
        # escape "?"
        Rule(LinkExtractor(allow=r'category.php\?id=\d+'),follow=False),
        Rule(LinkExtractor(allow=r'&page=\d+'),follow=False),
        Rule(LinkExtractor(allow=r'goods.php\?id=\d+'),follow=False,callback='parse_items'),
    )

    def parse_items(self,response):
        logging.info("You are in item")

        # This is used to print the results
        selector = scrapy.Selector(response=response)
        res = selector.xpath("/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()").extract()
        self.logger.info(res)

        # item = scrapy.loader.ItemLoader(article(),response)
        # item.add_xpath('category','/html/body/div[3]/div[2]/div[2]/a[2]/text()')
        # item.add_xpath('title','/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()')
        # item.add_xpath('quantity','//*[@id="ECS_FORMBUY"]/div[1]/ul/li[2]/font/text()')
        # item.add_xpath('price','//*[@id="ECS_RANKPRICE_2"]/text()')
        # yield item.load_item()

    # login part
    # I didn't test if it can login because I have no accounts, but they will print something in console.

    def start_requests(self):
        logging.info("You are in initRequest")
        return [scrapy.Request(url="http://www.combatzone.es/areadeclientes/user.php",callback=self.login)]

    def login(self,response):
        logging.info("You are in login")

        # generate the start_urls again:
        for url in self.start_urls:
            yield self.make_requests_from_url(url)

        # yield scrapy.FormRequest.from_response(response,formname='ECS_LOGINFORM',formdata={'username':'XXXX','password':'YYYY'},callback=self.check_login_response)

    # def check_login_response(self,response):
    #     logging.info("You are in checkLogin")
    #     if "Hola,XXXX" in response.body:
    #         self.log("Succesfully logged in.")
    #         return self.initialized()
    #     else:
    #         self.log("Something wrong in login.")