我试图为使用以下参数调用的单个网页编写通用搜寻器:
URL和允许的域参数似乎正常工作但我无法使xPath参数起作用。
我猜我需要声明一个变量来保持它正确,因为其他两个参数被分配给现有的类元素。
这是我的蜘蛛:
import scrapy
from Spotlite.items import SpotliteItem
class GenericSpider(scrapy.Spider):
name = "generic"
def __init__(self, start_url=None, allowed_domains=None, xpath_string=None, *args, **kwargs):
super(GenericSpider, self).__init__(*args, **kwargs)
self.start_urls = ['%s' % start_url]
self.allowed_domains = ['%s' % allowed_domains]
xpath_string = ['%s' % xpath_string]
def parse(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = SpotliteItem()
item['url'] = response.url
item['price'] = response.xpath(xpath_string).extract()
return item
我收到以下错误:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/ubuntu/spotlite/spotlite/spiders/generic.py", line 23, in parse
item['price'] = response.xpath(xpath_string).extract()
NameError:全局名称' xpath_string'未定义
任何帮助将不胜感激!
谢谢,
迈克尔
答案 0 :(得分:1)
将xpath_string
改为实例变量:
import scrapy
from Spotlite.items import SpotliteItem
class GenericSpider(scrapy.Spider):
name = "generic"
def __init__(self, start_url=None, allowed_domains=None, xpath_string=None, *args, **kwargs):
super(GenericSpider, self).__init__(*args, **kwargs)
self.start_urls = ['%s' % start_url]
self.allowed_domains = ['%s' % allowed_domains]
self.xpath_string = xpath_string
def parse(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = SpotliteItem()
item['url'] = response.url
item['price'] = response.xpath(self.xpath_string).extract()
return item
答案 1 :(得分:0)
将变量添加到初始类声明中解决了问题。
import scrapy
from spotlite.items import SpotliteItem
class GenericSpider(scrapy.Spider):
name = "generic"
xpath_string = ""
def __init__(self, start_url, allowed_domains, xpath_string, *args, **kwargs):
super(GenericSpider, self).__init__(*args, **kwargs)
self.start_urls = ['%s' % start_url]
self.allowed_domains = ['%s' % allowed_domains]
self.xpath_string = xpath_string
def parse(self, response):
self.logger.info('URL is %s', response.url)
self.logger.info('xPath is %s', self.xpath_string)
item = SpotliteItem()
item['url'] = response.url
item['price'] = response.xpath(self.xpath_string).extract()
return item