Scrapy:自定义回拨不起作用

时间:2016-07-24 18:04:10

标签: python callback scrapy web-crawler

我为什么我的蜘蛛无法工作而感到茫然!我是没有意味着程序员所以请善待!哈哈

背景 我试图通过使用" Scrapy"来抓取与Indigo上发现的书籍有关的信息。

问题: 我的代码没有进入我的任何自定义回调...它似乎只有在我使用"解析"作为回电。

如果我要改变"规则"来自" parse_books"的代码部分to" parse",我列出所有链接的方法很好,并打印出我感兴趣的所有链接。但是,该方法中的回调(指向" parse_books")永远不会被调用!奇怪的是,如果我要重命名"解析"其他方法(即 - >" testmethod")然后重命名" parse_books" "解析"的方法,我抓取信息到项目的方法工作得很好!

我想要实现的目标: 我想要做的就是输入一个页面,让我们说#畅销书",导航到所有项目的相应项目级别页面并刮掉所有与书籍相关的信息。我似乎都把这两件事都独立工作:/

守则!

import scrapy
import json
import urllib
from scrapy.http import Request
from urllib import urlencode
import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import urlparse



from TEST20160709.items import IndigoItem
from TEST20160709.items import SecondaryItem



item = IndigoItem()
scrapedItem = SecondaryItem()

class IndigoSpider(CrawlSpider):

    protocol='https://'
    name = "site"
    allowed_domains = [
    "chapters.indigo.ca/en-ca/Books",
    "chapters.indigo.ca/en-ca/Store/Availability/"
    ]

    start_urls = [
         'https://www.chapters.indigo.ca/en-ca/books/bestsellers/',
    ]

    #extractor = SgmlLinkExtractor()s

    rules = (
    Rule(LinkExtractor(), follow = True),
    Rule(LinkExtractor(), callback = "parse_books", follow = True),
    )



    def getInventory (self, bookID):
        params ={
       'pid' : bookID,
       'catalog' : 'books'
        }
        yield Request(
            url="https://www.chapters.indigo.ca/en-ca/Store/Availability/?" + urlencode(params),
            dont_filter = True,
            callback = self.parseInventory
        )



    def parseInventory(self,response):
        dataInventory = json.loads(response.body)

        for entry in dataInventory ['Data']:
            scrapedItem['storeID'] = entry['ID']
            scrapedItem['storeType'] = entry['StoreType']
            scrapedItem['storeName'] = entry['Name']
            scrapedItem['storeAddress'] = entry['Address']
            scrapedItem['storeCity'] = entry['City']
            scrapedItem['storePostalCode'] = entry['PostalCode']
            scrapedItem['storeProvince'] = entry['Province']
            scrapedItem['storePhone'] = entry['Phone']
            scrapedItem['storeQuantity'] = entry['QTY']
            scrapedItem['storeQuantityMessage'] = entry['QTYMsg']
            scrapedItem['storeHours'] = entry['StoreHours']
            scrapedItem['storeStockAvailibility'] = entry['HasRetailStock']
            scrapedItem['storeExclusivity'] = entry['InStoreExlusive']

            yield scrapedItem



    def parse (self, response):
        #GET ALL PAGE LINKS
        all_page_links = response.xpath('//ul/li/a/@href').extract()
        for relative_link in all_page_links:
            absolute_link = urlparse.urljoin(self.protocol+"www.chapters.indigo.ca",relative_link.strip())
            absolute_link = absolute_link.split("?ref=",1)[0]
            request = scrapy.Request(absolute_link, callback=self.parse_books)
            print "FULL link: "+absolute_link

            yield Request(absolute_link, callback=self.parse_books)





    def parse_books (self, response):

        for sel in response.xpath('//form[@id="aspnetForm"]/main[@id="main"]'):
            #XML/HTTP/CSS ITEMS
            item['title']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/h1[@id="product-title"][@class][@data-auto-id]/text()').extract())
            item['authors']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/h2[@class="major-contributor"]/a[contains(@class, "byLink")][@href]/text()').extract())
            item['productSpecs']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/p[@class="product-specs"]/text()').extract())
            item['instoreAvailability']= map(unicode.strip, sel.xpath('//span[@class="stockAvailable-mesg negative"][@data-auto-id]/text()').extract())
            item['onlinePrice']= map(unicode.strip, sel.xpath('//span[@id][@class="nonmemberprice__specialprice"]/text()').extract())
            item['listPrice']= map(unicode.strip, sel.xpath('//del/text()').extract())

            aboutBookTemp = map(unicode.strip, sel.xpath('//div[@class="read-more"]/p/text()').extract())
            item['aboutBook']= [aboutBookTemp]

            #Retrieve ISBN Identifier and extract numeric data
            ISBN_parse = map(unicode.strip, sel.xpath('(//div[@class="isbn-info"]/p[2])[1]/text()').extract())
            item['ISBN13']= [elem[11:] for elem in ISBN_parse]
            bookIdentifier = str(item['ISBN13'])
            bookIdentifier = re.sub("[^0-9]", "", bookIdentifier)


            print "THIS IS THE IDENTIFIER:" + bookIdentifier

            if bookIdentifier:
                yield self.getInventory(str(bookIdentifier))

            yield item

1 个答案:

答案 0 :(得分:1)

我注意到的第一个问题之一就是你的allowed_domains类属性被破坏了。它应该包含(因此名称)。

在您的情况下,正确的值将是:

allowed_domains = [
    "chapters.indigo.ca",  # subdomain.domain.top_level_domain
]

如果您查看蜘蛛日志,您会看到:

DEBUG: Filtered offsite request to 'www.chapters.indigo.ca'

这不应该发生。