刮擦不停止

时间:2019-09-12 10:34:04

标签: python-2.7 scrapy centos7

在生成项目后,抓痒不会停止。

通常我会一次刮两页。第一个用于Take链接,第二个用于内容。

但是现在我必须抓取第三页。

是否有可能拦截它不再刮擦的事实?

def parse(self, response):
    self.logger.info('response.url %s', response.url)

    while True:
        try:
            posts = response.xpath(self.ITEM_RESULT)
            num = 0

            for post in posts:
                item = SpiderItem()
                # self.logger.info('post ciclo %s', post.xpath(self.URL_CSS).extract()[num])
                item['url'] = sito + post.xpath(self.URL_CSS).extract()[num]
                num += 1
                if item['url']:
                    # self.logger.info('post ciclo %s', item['url'])

                    yield Request(url=item['url'], callback=self.parse_ad, meta={'item': item})
        except:
            break

def parse_ad(self, response):
    item = response.meta['item']
    single_ad = Selector(response)

    img_url = ''
    img_url_two = ''

    self.logger.info('RESPONSE URL PARSE = %s', response.url)

    nexta = response.xpath('//div[contains(@class, "anuRefBox")]/b').extract_first()
    nexta = comm.controls.eta(nexta)

link_frame ='http://www.xxxx.com/xxx-xxx/?id='+ nexta         尝试:             item ['description'] = single_ad.xpath(self.DESCRIPTION_CSS).extract_first()             如果item ['description']:                 item ['description'] = comm.controls.descrizione(item ['description'])         除了IndexError:             item ['description'] =''

    today = datetime.date.today()
    now = today.strftime('%Y-%m-%d')
    item['date'] = now

    if link_frame:
        # self.logger.info('post ciclo %s', item['url'])
        # sleep(2)
        yield Request(url=link_frame, callback=self.parse_frame_ad, meta={'item': item}, dont_filter=True)

def parse_frame_ad(self, response):
    if 'ConnectionWrongStateError' in response.body:
        raise scrapy.exceptions.CloseSpider('ConnectionWrongStateError')

    item = response.meta['item']
    single_frame_ad = Selector(response)
    self.logger.info('RESPONSE FRAME URL PARSE = %s', response.url)

......

    item['session_path'] = now
    # self.logger.info('images %s', images)
    self.logger.info('PATH %s', item['session_path'])

    if item:
        yield item
    else:
        raise DropItem("Missing item %s" % item)

def close(self, spider):
    today = datetime.datetime.now()
    now = today.strftime('%Y-%m-%d %H_%M_%S')
    self.logger.info('SCRAPY FINE %s', now)

现在,杂音不会停止。 似乎正在等待某事

2019-09-12 10:07:59 [scrapy.extensions.logstats]信息:抓取了116页(以0页/分钟),抓取了29条(以1条/分钟) 2019-09-12 10:08:59 [scrapy.extensions.logstats]信息:抓取116页(以0页/分钟),刮取29件(以0件/分钟) 2019-09-12 10:09:59 [scrapy.extensions.logstats]信息:抓取116页(以0页/分钟的速度),抓取29项(以0项/分钟的速度) 2019-09-12 10:10:59 [scrapy.extensions.logstats]信息:抓取了116页(以0页/分钟的速度),抓取29项(以0项/分钟的速度) 2019-09-12 10:11:59 [scrapy.extensions.logstats]信息:抓取了116页(以0页/分钟的速度),抓取29项(以0条/分钟的速度)

0 个答案:

没有答案