我正在抓取以下网站https://www2.woolworthsonline.com.au 我的第一步是获取产品类别列表,我可以这样做。
然后我需要获得子类别。子类似乎是动态生成的。我已经设置了我认为是请求的正确的标题和正文信息以及一个小的测试回调,以查看请求是否正常工作。我的代码永远不会进入回调,因此可能是请求有问题
def parse(self, response):
# Get links to the categories
categories_links = response.xpath('//a[@class="navigation-link"]').re(r'href="(\S+)"')
for link in categories_links:
# Generate and yield a request for the sub-categories
get_request_headers = dict()
get_request_headers['Accept'] = 'application/xml, text/xml, */*; q=0.01'
get_request_headers['Accept-Encoding'] = 'gzip, deflate, sdch'
get_request_headers['Accept-Language'] = 'en-US,en;q=0.8'
get_request_headers['Connection'] = 'keep-alive'
get_request_body = urllib.urlencode(
{'_mode' : 'ajax',
'_ajaxsource' : 'navigation-panel',
'_referrer' : link,
'_bannerViews' : '6064',
'_' : '1429428492880'}
)
url_link = 'https://www2.woolworthsonline.com.au'+link
yield Request(url=url_link, callback=self.subcategories, headers = get_request_headers, method='POST', body = get_request_body, meta={'category_link' : link} )
return
def subcategories(self, response):
print "sub-categories test: ", response.url
return
答案 0 :(得分:1)
试试这个,
BASE_URL = 'https://www2.woolworthsonline.com.au'
def parse(self, response):
categories = response.xpath(
'//a[@class="navigation-link"]/@href').extract()
for link in categories:
yield Request(url=self.BASE_URL + link, callback=parse_subcategory)
def parse_subcategory(self, response):
sub_categories = response.xpath(
'//li[@class="navigation-node navigation-category selected"]/ul/li/span/a/@href').extract()
for link in sub_categories:
yield Request(url=self.BASE_URL + link, callback=parse_products)
def parse_products(self, response):
# here you will get the list of products based on the subcategory
# extract the product details here
parse
函数category-url
(我们必须提供BASE_URL
以及提取的category-url
,因为这些网址是作为相对网址提供的而不是绝对网址。callback function
(此处为parse_subcategory
),我们可以获得每个category-url