在生成项目后,抓痒不会停止。
通常我会一次刮两页。第一个用于Take链接,第二个用于内容。
但是现在我必须抓取第三页。
是否有可能拦截它不再刮擦的事实?
def parse(self, response):
self.logger.info('response.url %s', response.url)
while True:
try:
posts = response.xpath(self.ITEM_RESULT)
num = 0
for post in posts:
item = SpiderItem()
# self.logger.info('post ciclo %s', post.xpath(self.URL_CSS).extract()[num])
item['url'] = sito + post.xpath(self.URL_CSS).extract()[num]
num += 1
if item['url']:
# self.logger.info('post ciclo %s', item['url'])
yield Request(url=item['url'], callback=self.parse_ad, meta={'item': item})
except:
break
def parse_ad(self, response):
item = response.meta['item']
single_ad = Selector(response)
img_url = ''
img_url_two = ''
self.logger.info('RESPONSE URL PARSE = %s', response.url)
nexta = response.xpath('//div[contains(@class, "anuRefBox")]/b').extract_first()
nexta = comm.controls.eta(nexta)
link_frame ='http://www.xxxx.com/xxx-xxx/?id='+ nexta 尝试: item ['description'] = single_ad.xpath(self.DESCRIPTION_CSS).extract_first() 如果item ['description']: item ['description'] = comm.controls.descrizione(item ['description']) 除了IndexError: item ['description'] =''
today = datetime.date.today()
now = today.strftime('%Y-%m-%d')
item['date'] = now
if link_frame:
# self.logger.info('post ciclo %s', item['url'])
# sleep(2)
yield Request(url=link_frame, callback=self.parse_frame_ad, meta={'item': item}, dont_filter=True)
def parse_frame_ad(self, response):
if 'ConnectionWrongStateError' in response.body:
raise scrapy.exceptions.CloseSpider('ConnectionWrongStateError')
item = response.meta['item']
single_frame_ad = Selector(response)
self.logger.info('RESPONSE FRAME URL PARSE = %s', response.url)
......
item['session_path'] = now
# self.logger.info('images %s', images)
self.logger.info('PATH %s', item['session_path'])
if item:
yield item
else:
raise DropItem("Missing item %s" % item)
def close(self, spider):
today = datetime.datetime.now()
now = today.strftime('%Y-%m-%d %H_%M_%S')
self.logger.info('SCRAPY FINE %s', now)
现在,杂音不会停止。 似乎正在等待某事
2019-09-12 10:07:59 [scrapy.extensions.logstats]信息:抓取了116页(以0页/分钟),抓取了29条(以1条/分钟) 2019-09-12 10:08:59 [scrapy.extensions.logstats]信息:抓取116页(以0页/分钟),刮取29件(以0件/分钟) 2019-09-12 10:09:59 [scrapy.extensions.logstats]信息:抓取116页(以0页/分钟的速度),抓取29项(以0项/分钟的速度) 2019-09-12 10:10:59 [scrapy.extensions.logstats]信息:抓取了116页(以0页/分钟的速度),抓取29项(以0项/分钟的速度) 2019-09-12 10:11:59 [scrapy.extensions.logstats]信息:抓取了116页(以0页/分钟的速度),抓取29项(以0条/分钟的速度)