使用start_url将额外值传递给Scrapy spider而不使用meta

时间:2018-04-02 20:01:46

标签: python web-scraping scrapy

我从蜘蛛中抓取了物品,我正在编写另一个使用搜索引擎来填充一些缺失数据的蜘蛛。我希望第一个蜘蛛的物品逐行更新。

但是,我不知道如何从__init__方法传递当前行或start_url。

我知道我可以将meta中的request.url传递给子请求,然后解析meta以提取公司名称,但它看起来很尴尬。

class DuckDuckGoComSpider(scrapy.Spider):
    name = 'duckduckgo.com'

    def __init__(self, csv_file_path, *args, **kwargs):
        self.csv_file_path = csv_file_path
        super(DuckDuckGoComSpider, self).__init__(*args, **kwargs)
        with open(csv_file_path, newline='') as csvfile:
            for row in csv.DictReader(csvfile):
                self.start_urls.append(
                    f'https://duckduckgo.com/html/?q="website" {row["name"]} {row["location"]}')

    def parse(self, response):
        results = list(response.css('.result__url::attr(href)'))
        if len(results) > 0:
            for i in range(6):
                yield response.follow(results[i], callback=self.parse_item)
        else:
            self.logger.debug('No more products')

    def parse_item(self, response):
        il = DDGItemLoader(response=response)
        il.add_value('url', response.url)
        il.add_css('title', 'meta[property="og:title"]::attr(content)')
        il.add_css('description',
                   'meta[property="og:description"]::attr(content)')

        item = il.load_item()
        yield item

1 个答案:

答案 0 :(得分:0)

有几种方法可以将值传递给解析方法,如casper所述:

  1. start_requests()中撰写请求并在meta
  2. 中传递所需数据
  3. 在类级别创建可用于引用所需数据的任何数据结构。可以在蜘蛛或自定义管道中更新数据
  4. 使用meta看起来像这样:

    class DuckDuckGoComBatchSpider(scrapy.Spider):
        name = 'duckduckgo_batch.com'
    
        def __init__(self, csv_file_path, *args, **kwargs):
            self.csv_file_path = csv_file_path
            super(DuckDuckGoComBatchSpider, self).__init__(*args, **kwargs)
    
        def start_requests(self):
            pages = []
            with open(self.csv_file_path, newline='') as csvfile:
                reader = csv.DictReader(csvfile)
                self.fieldnames = reader.fieldnames
                for row in reader:
                    url = f'https://duckduckgo.com/html/?q="website" {row["name"]} {row["location"]}'
                    meta = {}
                    for f in reader.fieldnames:
                        meta[f] = row[f]
                    page = scrapy.Request(url, callback=self.parse, meta=meta)
                    pages.append(page)
    
            return pages
    
        def parse(self, response):
            results = list(response.css('.result__url::attr(href)'))
            if len(results) > 0:
                yield response.follow(results[0], callback=self.parse_item,
                                      meta=response.meta)
            else:
                self.logger.debug('No more products')
    
        def parse_item(self, response):
            il = DDGItemLoader(response=response)
            il.add_value('website', response.url)
            il.add_css('website_title', 'meta[property="og:title"]::attr(content)')
            il.add_css('website_description',
                       'meta[property="og:description"]::attr(content)')
            il.add_value('name', response.meta["name"])
    
            item = il.load_item()
            for key in response.meta:
                if key in self.fieldnames:
                    item[key] = response.meta[key]
            yield item