BaseSpider类中的函数以生成请求

时间:2017-10-31 10:22:07

标签: python scrapy

我正在尝试创建一个处理多个蜘蛛中的重复任务的函数。它涉及产生似乎打破它的请求。这个问题是this question的后续行动。

import scrapy
import json
import re

class BaseSpider(scrapy.Spider):

    start_urls = {}

    def test(self, response, cb, xpath):
        self.logger.info('Success')
        for url in response.xpath(xpath).extract():
            req = scrapy.Request(response.urljoin(url), callback=cb)
            req.meta['category'] = response.meta.get('category')
            yield req

yield req在代码中时,"成功"记录器突然不再工作,似乎没有调用回调函数。当评论yield req时,记录器会显示"成功"记录仪。虽然我不认为这个问题出现在蜘蛛中,但在蜘蛛的代码之下:

# -*- coding: utf-8 -*-
import scrapy
from crawling.spiders import BaseSpider

class testContactsSpider(BaseSpider):
    """ Test spider """
    name = "test"
    start_urls = {}
    start_urls['test'] = 'http://www.thewatchobserver.fr/petites-annonces-montres#.WfMaIxO0Pm3'

    def parse(self,response):
        self.logger.info('Base page: %s', response.url)
        self.test(response, self.parse_page, '//h3/a/@href')

    def parse_page(self, response):
        self.logger.info('Page: %s', response.url)

1 个答案:

答案 0 :(得分:1)

我认为你需要这样的东西:

text

测试方法产生结果,这就是它返回生成器类型的原因。 尝试下面的代码并阅读Understanding Generators in Python

   def parse(self,response):
    self.logger.info('Base page: %s', response.url)
    for req in self.test(response, self.parse_page, '//h3/a/@href'):
        yield req


在这个例子中我们不使用generator:

def test():
  print('Inside generator!')
  for i in range(5):
    yield i

print('============')
g = test() #save as variable
test()     #trying to call func
print('============')
print(next(g)) #next of "g" generator
print(next(g))
print('============')
print(next(test())) #next of newly created generator
print(next(test()))
print('============')
for i in test(): #for each elem that returns generator
  print(i)

在此我们试图获得下一个元素,这就是为什么它被称为:

self.test(response, self.parse_page, '//h3/a/@href')