我正在尝试创建一个处理多个蜘蛛中的重复任务的函数。它涉及产生似乎打破它的请求。这个问题是this question的后续行动。
import scrapy
import json
import re
class BaseSpider(scrapy.Spider):
start_urls = {}
def test(self, response, cb, xpath):
self.logger.info('Success')
for url in response.xpath(xpath).extract():
req = scrapy.Request(response.urljoin(url), callback=cb)
req.meta['category'] = response.meta.get('category')
yield req
当yield req
在代码中时,"成功"记录器突然不再工作,似乎没有调用回调函数。当评论yield req
时,记录器会显示"成功"记录仪。虽然我不认为这个问题出现在蜘蛛中,但在蜘蛛的代码之下:
# -*- coding: utf-8 -*-
import scrapy
from crawling.spiders import BaseSpider
class testContactsSpider(BaseSpider):
""" Test spider """
name = "test"
start_urls = {}
start_urls['test'] = 'http://www.thewatchobserver.fr/petites-annonces-montres#.WfMaIxO0Pm3'
def parse(self,response):
self.logger.info('Base page: %s', response.url)
self.test(response, self.parse_page, '//h3/a/@href')
def parse_page(self, response):
self.logger.info('Page: %s', response.url)
答案 0 :(得分:1)
我认为你需要这样的东西:
text
测试方法产生结果,这就是它返回生成器类型的原因。 尝试下面的代码并阅读Understanding Generators in Python:
def parse(self,response):
self.logger.info('Base page: %s', response.url)
for req in self.test(response, self.parse_page, '//h3/a/@href'):
yield req
在这个例子中我们不使用generator:
def test():
print('Inside generator!')
for i in range(5):
yield i
print('============')
g = test() #save as variable
test() #trying to call func
print('============')
print(next(g)) #next of "g" generator
print(next(g))
print('============')
print(next(test())) #next of newly created generator
print(next(test()))
print('============')
for i in test(): #for each elem that returns generator
print(i)
在此我们试图获得下一个元素,这就是为什么它被称为:
self.test(response, self.parse_page, '//h3/a/@href')