Scrapy SitemapSpider无法正常工作

时间:2017-09-02 13:34:46

标签: python selenium-webdriver scrapy scrapy-spider

我正试图抓住一家知名英国零售商的网站,并获得如下属性错误:

  _parse_sitemap中的

nl_env / lib / python3.6 / site-packages / scrapy / spiders / sitemap.py“,第52行       对于r,c in self._cbs:

     

AttributeError:'NlSMCrawlerSpider'对象没有属性'_cbs'

我可能没有完全想到SitemapSpider的运作方式 - 请参阅下面的代码:

class NlSMCrawlerSpider(SitemapSpider):
name = 'nl_smcrawler'
allowed_domains = ['newlook.com']
sitemap_urls = ['http://www.newlook.com/uk/sitemap/maps/sitemap_uk_product_en_1.xml']
sitemap_follow = ['/uk/womens/clothing/']

# sitemap_rules = [
#     ('/uk/womens/clothing/', 'parse_product'),
# ]


def __init__(self):
    self.driver = webdriver.Safari()
    self.driver.set_window_size(800,600)
    time.sleep(2)


def parse_product(self, response):
    driver = self.driver
    driver.get(response.url)
    time.sleep(1)

    # Collect products
    itemDetails = driver.find_elements_by_class_name('product-details-page content')


    # Pull features
    desc = itemDetails[0].find_element_by_class_name('product-description__name').text
    href = driver.current_url

    # Generate a product identifier
    identifier = href.split('/p/')[1].split('?comp')[0]
    identifier = int(identifier)

    # datetime
    dt = date.today()
    dt = dt.isoformat()

    # Price Symbol removal and integer conversion
    try:
        priceString = itemDetails[0].find_element_by_class_name('price product-description__price').text
    except:
        priceString = itemDetails[0].find_element_by_class_name('price--previous-price product-description__price--previous-price ng-scope').text
    priceInt = priceString.split('£')[1]
    originalPrice = float(priceInt)

    # discountedPrice Logic
    try:
        discountedPriceString = itemDetails[0].find_element_by_class_name('price price--marked-down product-description__price').text
        discountedPriceInt = discountedPriceString.split('£')[1]
        discountedPrice = float(discountedPriceInt)
    except:
        discountedPrice = 'N/A'

    # NlScrapeItem
    item = NlScrapeItem()

    # Append product to NlScrapeItem
    item['identifier'] = identifier
    item['href'] = href
    item['description'] = desc
    item['originalPrice'] = originalPrice
    item['discountedPrice'] = discountedPrice
    item['firstSighted'] = dt
    item['lastSighted'] = dt

    yield item

此外,请不要犹豫,询问任何进一步的详细信息,请参阅sitemap的链接以及Scrapy包中实际文件的链接,以避免错误(link - github)。非常感谢您的帮助。

编辑:一个想法 查看2nd link(来自Scrapy包),我可以看到_cbs在def __init__(self, *a, **kw):函数中初始化 - 事实上我有自己的 init 逻辑将其抛弃?

1 个答案:

答案 0 :(得分:1)

你的刮刀有两个问题。一个是__init__方法

def __init__(self):
    self.driver = webdriver.Safari()
    self.driver.set_window_size(800, 600)
    time.sleep(2)

现在您已经定义了一个新的__init__并覆盖了基类__init__。您的init不会调用它,因此_cbs未初始化。您可以通过更改init方法轻松解决此问题,如下所示

def __init__(self, *a, **kw):
    super(NlSMCrawlerSpider, self).__init__(*a, **kw)

    self.driver = webdriver.Safari()
    self.driver.set_window_size(800, 600)
    time.sleep(2)

接下来,SitemapScraper将始终向parse方法发送响应。而且你还没有定义解析方法。所以我添加了一个简单的打印网址

def parse(self, response):
    print(response.url)