刮:抓第二级网址

时间:2017-02-16 06:42:36

标签: python web-scraping scrapy

在下面的代码中,解析函数执行大约32次(找到循环32 href&),每个子链接应该去并刮掉数据(32个单独的URL parse_next 函数) 。但 parse_next 函数只执行一次(单向)/未调用(输出csv文件为空。任何人都可以帮助我,我错了。

import scrapy
import logging

logger = logging.getLogger('mycustomlogger')

from ScrapyTestProject.items import ScrapytestprojectItem
class QuotesSpider(scrapy.Spider):
    name = "nestedurl"
    allowed_domains = ['www.grohe.in']
    start_urls = [
        'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/',
def parse(self, response):
    logger.info("Parse function called on %s", response.url)
    for divs in response.css('div.viewport div.workspace div.float-box'):
        item = {'producturl': divs.css('a::attr(href)').extract_first(),
                'imageurl': divs.css('a img::attr(src)').extract_first(),
                'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()}
        next_page = response.urljoin(item['producturl'])
        #logger.info("This is an information %s", next_page)
        yield scrapy.Request(next_page, callback=self.parse_next, meta={'item': item})
        #yield item

def parse_next(self, response):
    item = response.meta['item']
    logger.info("Parse function called on2 %s", response.url)
    item['headline'] = response.css('div#content a.headline::text').extract()
    return item
    #response.css('div#product-variants a::attr(href)').extract()

1 个答案:

答案 0 :(得分:0)

好的,所以有些事情出错:

  • 压痕
  • start_urls列表未使用[
  • ]关闭
  • allowed_domains使用域扩展名.in,而您想要抓取.com

以下工作代码:

import scrapy
import logging

class QuotesSpider(scrapy.Spider):
    name = "nestedurl"
    allowed_domains = ['www.grohe.com']
    start_urls = [
        'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/'
    ]
    def parse(self, response):
        # logger.info("Parse function called on %s", response.url)
        for divs in response.css('div.viewport div.workspace div.float-box'):
            item = {'producturl': divs.css('a::attr(href)').extract_first(),
                    'imageurl': divs.css('a img::attr(src)').extract_first(),
                    'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()}
            next_page = response.urljoin(item['producturl'])
            #logger.info("This is an information %s", next_page)
            yield scrapy.Request(next_page, callback=self.parse_next, meta={'item': item})
            #yield item

    def parse_next(self, response):
        item = response.meta['item']
        # logger.info("Parse function called on2 %s", response.url)
        item['headline'] = response.css('div#content a.headline::text').extract()
        return item
        #response.css('div#product-variants a::attr(href)').extract()

注意:删除了一些日志/项目管道,因为这些管道未在我的机器上定义。