Question

我试图为使用以下参数调用的单个网页编写通用搜寻器：

允许的域名
要抓取的网址
xPath在网页中提取价格

URL和允许的域参数似乎正常工作但我无法使xPath参数起作用。

我猜我需要声明一个变量来保持它正确，因为其他两个参数被分配给现有的类元素。

这是我的蜘蛛：

import scrapy
from Spotlite.items import SpotliteItem

class GenericSpider(scrapy.Spider):
   name = "generic"

   def __init__(self, start_url=None, allowed_domains=None, xpath_string=None, *args, **kwargs):
      super(GenericSpider, self).__init__(*args, **kwargs)
      self.start_urls = ['%s' % start_url]
      self.allowed_domains = ['%s' % allowed_domains]
      xpath_string = ['%s' % xpath_string]

   def parse(self, response):
      self.logger.info('Hi, this is an item page! %s', response.url)
      item = SpotliteItem()
      item['url'] = response.url
      item['price'] = response.xpath(xpath_string).extract()
      return item

我收到以下错误：

Traceback (most recent call last):
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/home/ubuntu/spotlite/spotlite/spiders/generic.py", line 23, in parse
     item['price'] = response.xpath(xpath_string).extract()

NameError：全局名称＆＃39; xpath_string＆＃39;未定义

任何帮助将不胜感激！

谢谢，

迈克尔

Answer 1

将xpath_string改为实例变量：

import scrapy
from Spotlite.items import SpotliteItem

class GenericSpider(scrapy.Spider):
   name = "generic"

   def __init__(self, start_url=None, allowed_domains=None, xpath_string=None, *args, **kwargs):
      super(GenericSpider, self).__init__(*args, **kwargs)
      self.start_urls = ['%s' % start_url]
      self.allowed_domains = ['%s' % allowed_domains]
      self.xpath_string = xpath_string

   def parse(self, response):
      self.logger.info('Hi, this is an item page! %s', response.url)
      item = SpotliteItem()
      item['url'] = response.url
      item['price'] = response.xpath(self.xpath_string).extract()
      return item

Answer 2

将变量添加到初始类声明中解决了问题。

import scrapy
from spotlite.items import SpotliteItem


class GenericSpider(scrapy.Spider):
   name = "generic"
   xpath_string = ""

   def __init__(self, start_url, allowed_domains, xpath_string, *args, **kwargs):
       super(GenericSpider, self).__init__(*args, **kwargs)
       self.start_urls = ['%s' % start_url]
       self.allowed_domains = ['%s' % allowed_domains]
       self.xpath_string = xpath_string

    def parse(self, response):
       self.logger.info('URL is %s', response.url)
       self.logger.info('xPath is %s', self.xpath_string)
       item = SpotliteItem()
       item['url'] = response.url
       item['price'] = response.xpath(self.xpath_string).extract()
       return item

将xPath作为参数传递给Scrapy

2 个答案: