我正在尝试使用scrapy刮刮一个电子商务页面,并且代码看起来像这样
class HugobossSpider(scrapy.Spider):
name = 'hugoboss'
allowed_domains = ['hugoboss.com/de/herren-schuhe/?sz=60&start=0']
start_urls = ['https://hugoboss.com/de/herren-schuhe/?sz=60&start=0']
def parse(self, response):
# The main method of the spider. It scrapes the URL(s) specified in the
# 'start_url' argument above. The content of the scraped URL is passed on
# as the 'response' object.
nextpageurl = response.xpath("//a[@title='Weiter']/@href")
for item in self.scrape(response):
yield item
if nextpageurl:
path = nextpageurl.extract_first()
nextpage = response.urljoin(path)
print("Found url: {}".format(nextpage))
yield Request(nextpage, callback=self.parse)
def parse(self, response):
#Extracting the content using css selectors
url = response.xpath('//div/@data-mouseoverimage').extract()
product_title = response.xpath('//*[@class="product- tile__productInfoWrapper product-tile__productInfoWrapper--is-small font__subline"]/text()').extract()
price = response.css('.product-tile__offer .price-sales::text').getall()
#Give the extracted content row wise
for item in zip(url,product_title,price):
#create a dictionary to store the scraped info
item = {
'URL' : item[0],
'Product Name' : item[1].replace("\n", '').replace("von", ""),
'Price' : item[2]
}
#yield or give the scraped info to scrapy
yield item
问题在于代码正在提取当前页面的信息,但无法提取所有页面的信息。 有人可以帮忙吗?
答案 0 :(得分:1)
您已经定义了函数def parse()
的两倍
重命名第二个(也许是def extract()
),然后重试。