我正试图抓住一家知名英国零售商的网站,并获得如下属性错误:
_parse_sitemap中的nl_env / lib / python3.6 / site-packages / scrapy / spiders / sitemap.py“,第52行 对于r,c in self._cbs:
AttributeError:'NlSMCrawlerSpider'对象没有属性'_cbs'
我可能没有完全想到SitemapSpider的运作方式 - 请参阅下面的代码:
class NlSMCrawlerSpider(SitemapSpider):
name = 'nl_smcrawler'
allowed_domains = ['newlook.com']
sitemap_urls = ['http://www.newlook.com/uk/sitemap/maps/sitemap_uk_product_en_1.xml']
sitemap_follow = ['/uk/womens/clothing/']
# sitemap_rules = [
# ('/uk/womens/clothing/', 'parse_product'),
# ]
def __init__(self):
self.driver = webdriver.Safari()
self.driver.set_window_size(800,600)
time.sleep(2)
def parse_product(self, response):
driver = self.driver
driver.get(response.url)
time.sleep(1)
# Collect products
itemDetails = driver.find_elements_by_class_name('product-details-page content')
# Pull features
desc = itemDetails[0].find_element_by_class_name('product-description__name').text
href = driver.current_url
# Generate a product identifier
identifier = href.split('/p/')[1].split('?comp')[0]
identifier = int(identifier)
# datetime
dt = date.today()
dt = dt.isoformat()
# Price Symbol removal and integer conversion
try:
priceString = itemDetails[0].find_element_by_class_name('price product-description__price').text
except:
priceString = itemDetails[0].find_element_by_class_name('price--previous-price product-description__price--previous-price ng-scope').text
priceInt = priceString.split('£')[1]
originalPrice = float(priceInt)
# discountedPrice Logic
try:
discountedPriceString = itemDetails[0].find_element_by_class_name('price price--marked-down product-description__price').text
discountedPriceInt = discountedPriceString.split('£')[1]
discountedPrice = float(discountedPriceInt)
except:
discountedPrice = 'N/A'
# NlScrapeItem
item = NlScrapeItem()
# Append product to NlScrapeItem
item['identifier'] = identifier
item['href'] = href
item['description'] = desc
item['originalPrice'] = originalPrice
item['discountedPrice'] = discountedPrice
item['firstSighted'] = dt
item['lastSighted'] = dt
yield item
此外,请不要犹豫,询问任何进一步的详细信息,请参阅sitemap的链接以及Scrapy包中实际文件的链接,以避免错误(link - github)。非常感谢您的帮助。
编辑:一个想法
查看2nd link(来自Scrapy包),我可以看到_cbs在def __init__(self, *a, **kw):
函数中初始化 - 事实上我有自己的 init 逻辑将其抛弃?
答案 0 :(得分:1)
你的刮刀有两个问题。一个是__init__
方法
def __init__(self):
self.driver = webdriver.Safari()
self.driver.set_window_size(800, 600)
time.sleep(2)
现在您已经定义了一个新的__init__
并覆盖了基类__init__
。您的init不会调用它,因此_cbs
未初始化。您可以通过更改init方法轻松解决此问题,如下所示
def __init__(self, *a, **kw):
super(NlSMCrawlerSpider, self).__init__(*a, **kw)
self.driver = webdriver.Safari()
self.driver.set_window_size(800, 600)
time.sleep(2)
接下来,SitemapScraper将始终向parse方法发送响应。而且你还没有定义解析方法。所以我添加了一个简单的打印网址
def parse(self, response):
print(response.url)