使用蜘蛛爬虫在蜘蛛__init__中的参数

时间:2017-04-15 18:40:17

标签: python-2.7 scrapy web-crawler scrapy-spider scrapyd

我试图使用蜘蛛爬虫代码来获取一些房地产数据。但它一直给我这个错误:

  

追踪(最近一次呼叫最后一次):

     

文件" // anaconda / lib / python2.7 / site-packages / twisted / internet / defer.py",第1301行,在_inlineCallbacks中       result = g.send(result)

     

文件" // anaconda / lib / python2.7 / site-packages / scrapy / crawler.py",第90行,在抓取中       six.reraise(* exc_info)

     

文件" // anaconda / lib / python2.7 / site-packages / scrapy / crawler.py",第71行,在抓取       self.spider = self._create_spider(* args,** kwargs)

     

文件" // anaconda / lib / python2.7 / site-packages / scrapy / crawler.py",第94行,在_create_spider中       return self.spidercls.from_crawler(self,* args,** kwargs)

     

文件" // anaconda / lib / python2.7 / site-packages / scrapy / spiders / crawl.py",第96行,来自from_crawler       spider = super(CrawlSpider,cls).from_crawler(crawler,* args,** kwargs)

     

文件" // anaconda / lib / python2.7 / site-packages / scrapy / spiders / init .py",第50行,来自from_crawler       spider = cls(* args,** kwargs)

     

TypeError: init ()只需要3个参数(给定1个)

以下是定义抓取工具的代码:

class RealestateSpider(scrapy.spiders.CrawlSpider):

    ###Real estate web crawler
    name = 'buyrentsold'
    allowed_domains = ['realestate.com.au']

    def __init__(self, command, search):
        search = re.sub(r'\s+', '+', re.sub(',+', '%2c', search)).lower()
        url = '/{0}/in-{{0}}{{{{0}}}}/list-{{crawler = scrapy.crawler.CrawlerProcess(scrapy.conf.settings)
sp=RealestateSpider(command, search)
crawler.crawl(sp)
crawler.start()
}}'.format(command)
        start_url = 'http://www.{0}{1}'
        start_url = start_url.format(
                self.allowed_domains[0], url.format(search)
        )
        self.start_urls = [start_url.format('', 1)]
        extractor = scrapy.linkextractors.sgml.SgmlLinkExtractor(
                allow=url.format(re.escape(search)).format('.*', '')
        )
        rule = scrapy.spiders.Rule(
                extractor, callback='parse_items', follow=True
        )
        self.rules = [rule]
        super(RealestateSpider, self).__init__()

    def parse_items(self, response):
        ###Parse a page of real estate listings
        hxs = scrapy.selector.HtmlXPathSelector(response)
        for i in hxs.select('//div[contains(@class, "listingInfo")]'):
            item = RealestateItem()
            path = 'div[contains(@class, "propertyStats")]//text()'
            item['price'] = i.select(path).extract()
            vcard = i.select('div[contains(@class, "vcard")]//a')
            item['address'] = vcard.select('text()').extract()
            url = vcard.select('@href').extract()
            if len(url) == 1:
                item['url'] = 'http://www.{0}{1}'.format(
                        self.allowed_domains[0], url[0]
                )
            features = i.select('dl')
            for field in ('bed', 'bath', 'car'):
                path = '(@class, "rui-icon-{0}")'.format(field)
                path = 'dt[contains{0}]'.format(path)
                path = '{0}/following-sibling::dd[1]'.format(path)
                path = '{0}/text()'.format(path)
                item[field] = features.select(path).extract() or 0
            yield item

这是erorr出现的时候:

// Replaced 'querySelector()' with 'querySelectorAll()' to get all elements
var menu_item = document.querySelectorAll("div.righ-nav-style1 ul li a");            

// Created 'for' loop to use 'replaceChild()' and 'appendChild()' methods for all elements
for (var i = 0; i < menu_item.length; i++){
  var parent = menu_item[i].parentNode;
  var wrapper = document.createElement('h2');
  wrapper.className = 'category_links';    

  // set the wrapper as child (instead of the 'menu_item')
  parent.replaceChild(wrapper, menu_item[i]);
  // set 'menu_item' as child of wrapper
  wrapper.appendChild(menu_item[i]);
}

任何人都可以帮我解决这个问题吗?谢谢!

2 个答案:

答案 0 :(得分:1)

crawler.crawl()方法需要使用spider class 作为参数,在代码中提供了一个蜘蛛对象。

有几种方法可以做到这一点,但最直接的方法就是扩展蜘蛛类:

class MySpider(Spider):
    command = None
    search = None

    def __init__(self):
        # do something with self.command and self.search
        super(RealestateSpider, self).__init__()

然后:

crawler = scrapy.crawler.CrawlerProcess(scrapy.conf.settings)
class MySpider(RealestateSpider):
    command = 'foo'
    search = 'bar'
crawler.crawl(MySpider)
crawler.start()

答案 1 :(得分:0)

我遇到了这个确切的问题,上述解决方案对我来说太困难了。
但是我通过将参数作为类属性传递来避免了这个问题:

{'Em': {'C': '2',
        'F': '3',
        'from': '2020-02-29T20:00:00.000Z',
        'to': '2020-03-20T20:00:00.000Z'},
 'OS': '18',
 'R': '3.7',
 'holidays': '165',
 'sC': {'C': '31',
        'F': '25',
        'from': '2020-03-31T20:00:00.000Z',
        'to': '2020-05-29T20:00:00.000Z'},
 'students': '289',
 'teachers': '49'}

希望这将是解决您问题的潜在方法。