我在这里和其他网站上读过很多有关刮y的文章,我无法解决此问题,所以我问你:P希望有人可以帮助我。
我想在主客户端页面上验证登录名,然后解析所有类别,然后解析所有产品,并保存产品标题,其类别,数量和价格。
我的代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import logging
class article(Item):
category = Field()
title = Field()
quantity = Field()
price = Field()
class combatzone_spider(CrawlSpider):
name = 'combatzone_spider'
allowed_domains = ['www.combatzone.es']
start_urls = ['http://www.combatzone.es/areadeclientes/']
rules = (
Rule(LinkExtractor(allow=r'/category.php?id=\d+'),follow=True),
Rule(LinkExtractor(allow=r'&page=\d+'),follow=True),
Rule(LinkExtractor(allow=r'goods.php?id=\d+'),follow=True,callback='parse_items'),
)
def init_request(self):
logging.info("You are in initRequest")
return Request(url=self,callback=self.login)
def login(self,response):
logging.info("You are in login")
return scrapy.FormRequest.from_response(response,formname='ECS_LOGINFORM',formdata={'username':'XXXX','password':'YYYY'},callback=self.check_login_response)
def check_login_response(self,response):
logging.info("You are in checkLogin")
if "Hola,XXXX" in response.body:
self.log("Succesfully logged in.")
return self.initialized()
else:
self.log("Something wrong in login.")
def parse_items(self,response):
logging.info("You are in item")
item = scrapy.loader.ItemLoader(article(),response)
item.add_xpath('category','/html/body/div[3]/div[2]/div[2]/a[2]/text()')
item.add_xpath('title','/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()')
item.add_xpath('quantity','//*[@id="ECS_FORMBUY"]/div[1]/ul/li[2]/font/text()')
item.add_xpath('price','//*[@id="ECS_RANKPRICE_2"]/text()')
yield item.load_item()
当我在终端上运行抓痒的爬行蜘蛛时,我得到了:
SCRAPY)pi @ raspberry:〜/ SCRAPY / combatzone / combatzone / spiders $ scrapy爬行fightingzone_spider /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9: ScrapyDeprecationWarning:模块
scrapy.contrib.spiders
是 弃用,请改用scrapy.spiders
scrapy.contrib.spiders.init导入InitSpider /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9: ScrapyDeprecationWarning:模块scrapy.contrib.spiders.init
是 弃用,请改用scrapy.spiders.init
scrapy.contrib.spiders.init导入InitSpider 2018-07-24 22:14:53 [scrapy.utils.log]信息:Scrapy 1.5.1已启动(机器人:battlezone) 2018-07-24 22:14:53 [scrapy.utils.log]信息:版本:lxml 4.2.3.0, libxml2 2.9.8,cssselect 1.0.3,parsel 1.5.0,w3lib 1.19.0,Twisted 18.7.0,Python 2.7.13(默认值,2017年11月24日17:33:09)-[GCC 6.3.0 20170516],pyOpenSSL 18.0.0(OpenSSL 1.1.0h 2018年3月27日), 加密技术2.3,平台Linux-4.9.0-6-686-i686-with-debian-9.5 2018-07-24 22:14:53 [scrapy.crawler]信息:覆盖的设置: {'NEWSPIDER_MODULE':'combatzone.spiders','SPIDER_MODULES': ['combatzone.spiders'],'LOG_LEVEL':'INFO','BOT_NAME':'combatzone'} 2018-07-24 22:14:53 [scrapy.middleware]信息:启用的扩展: ['scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2018-07-24 22:14:53 [scrapy.middleware] INFO:已启用下载程序中间件: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', “ scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware”, 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', “ scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware”, 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-07-24 22:14:53 [scrapy.middleware]信息:已启用蜘蛛中间件: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', [scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-07-24 22:14:53 [scrapy.middleware] INFO:启用的项目管道:[] 2018-07-24 22:14:53 [scrapy.core.engine] 信息:蜘蛛开了 2018-07-24 22:14:53 [scrapy.extensions.logstats] 信息:抓取0页(在0处 页/分钟),抓取0个项目(以0个项目/分钟的速度) 2018-07-24 22:14:54 [scrapy.core.engine] 信息:闭合蜘蛛(已完成) 2018-07-24 22:14:54 [scrapy.statscollectors]信息:倾销Scrapy统计信息: {'downloader / request_bytes':231,'downloader / request_count':1, “ downloader / request_method_count / GET”:1, 'downloader / response_bytes':7152,'downloader / response_count':1, 'downloader / response_status_count / 200':1,'finish_reason': 'finished','finish_time':datetime.datetime(2018,7,24,21,14,54, 410938),'log_count / INFO':7,'memusage / max':36139008, 'memusage / startup':36139008,'response_received_count':1, '调度程序/出队/内存':1,'调度程序/出队/内存':1, '调度程序/排队/内存':1,'调度程序/排队/内存':1 'start_time':datetime.datetime(2018、7、24、21、14、53、998619)} 2018-07-24 22:14:54 [scrapy.core.engine]信息:蜘蛛关闭 (完成)
蜘蛛似乎没有工作,为什么会这样呢? 非常感谢队友:D
答案 0 :(得分:1)
有2个问题:
/category.php?id=\d+
应该更改为/category.php\?id=\d+
(注意“ \?” )关于登录,我试图使您的代码正常工作,但失败了。我通常会覆盖start_requests
以便在抓取之前登录。
代码如下:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import logging
class article(Item):
category = Field()
title = Field()
quantity = Field()
price = Field()
class CombatZoneSpider(CrawlSpider):
name = 'CombatZoneSpider'
allowed_domains = ['www.combatzone.es']
start_urls = ['http://www.combatzone.es/areadeclientes/']
rules = (
# escape "?"
Rule(LinkExtractor(allow=r'category.php\?id=\d+'),follow=False),
Rule(LinkExtractor(allow=r'&page=\d+'),follow=False),
Rule(LinkExtractor(allow=r'goods.php\?id=\d+'),follow=False,callback='parse_items'),
)
def parse_items(self,response):
logging.info("You are in item")
# This is used to print the results
selector = scrapy.Selector(response=response)
res = selector.xpath("/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()").extract()
self.logger.info(res)
# item = scrapy.loader.ItemLoader(article(),response)
# item.add_xpath('category','/html/body/div[3]/div[2]/div[2]/a[2]/text()')
# item.add_xpath('title','/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()')
# item.add_xpath('quantity','//*[@id="ECS_FORMBUY"]/div[1]/ul/li[2]/font/text()')
# item.add_xpath('price','//*[@id="ECS_RANKPRICE_2"]/text()')
# yield item.load_item()
# login part
# I didn't test if it can login because I have no accounts, but they will print something in console.
def start_requests(self):
logging.info("You are in initRequest")
return [scrapy.Request(url="http://www.combatzone.es/areadeclientes/user.php",callback=self.login)]
def login(self,response):
logging.info("You are in login")
# generate the start_urls again:
for url in self.start_urls:
yield self.make_requests_from_url(url)
# yield scrapy.FormRequest.from_response(response,formname='ECS_LOGINFORM',formdata={'username':'XXXX','password':'YYYY'},callback=self.check_login_response)
# def check_login_response(self,response):
# logging.info("You are in checkLogin")
# if "Hola,XXXX" in response.body:
# self.log("Succesfully logged in.")
# return self.initialized()
# else:
# self.log("Something wrong in login.")