如何在scrapy中抓取url的url?

时间:2016-11-22 12:24:59

标签: python web-scraping scrapy scrapy-spider

主要网址= [https://www.amazon.in/s/ref=nb_sb_ss_i_1_8?url=search-alias%3Dcomputers&field-keywords=lenovo+laptop&sprefix=lenovo+m%2Cundefined%2C2740&crid=3L1Q2LMCKALCT]

从主网址中提取网址= [http://www.amazon.in/Lenovo-Ideapad-15-6-inch-Integrated-Graphics/dp/B01EN6RA7W?ie=UTF8&keywords=lenovo%20laptop&qid=1479811190&ref_=sr_1_1&s=computers&sr=1-1]

import scrapy
from product.items import ProductItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class amazonSpider(scrapy.Spider):
    name = "amazon"
    allowed_domains = ["amazon.in"]
    start_urls = [ main url here]
    def parse(self, response):
        item=ProductItem()
        for  content in response.xpath("sample xpath"):
            url = content.xpath("a/@href").extract()
            request =    scrapy.Request(str(url[0]),callback=self.page2_parse)
        #url is extracted from my main url
            item['product_Rating'] = request
        yield item
    def page2_parse(self,response):
    #here i dint get the response for the second url content
        for content in response.xpath(sample xpath):
            yield content.xpath(sample xpath).extract()

此处未执行第二项功能。请帮帮我。

1 个答案:

答案 0 :(得分:0)

最后我已经这样做了,请按照以下代码实现url url的抓取值。

def parse(self, response):
    item=ProductItem()
    url_list = [content for  content in response.xpath("//div[@class='listing']/div/a/@href").extract()]
    item['product_DetailUrl'] = url_list
    for url in url_list:
         request =  Request(str(url),callback=self.page2_parse)
         request.meta['item'] = item
         yield request

def page2_parse(self,response):
    item=ProductItem()
    item = response.meta['item']
    item['product_ColorAvailability'] = [content for content in  response.xpath("//div[@id='templateOption']//ul/li//img/@color").extract()]
    yield item