如何从同一行中的多个页面中抓取数据

时间:2017-10-11 20:37:19

标签: json web-scraping

如何从同一行的多个页面中抓取数据?

main page > next page (scrape title) > sub page (scrape img)
在我的例子中

all product > product page n°1 (scrape title) > sub page product n°1(scrape img)

            > product page n°2 (scrape title) > sub page product n°2 (scrape img)

结果我的json(坏):

  • 第1行title_1 = ........(产品编号1)
  • 第2行img_hd_1 = ........(产品编号1)
  • 第3行title_1 = ........(产品编号2)
  • 第4行img_hd_1 = ........(产品编号2)

期望的结果:

  • 第1行title_1 = ........,img_hd_1 = ........(产品编号1)
  • 第2行title_1 = ........,img_hd_1 = ........(产品编号2)

结果很好但不是结构。 如何从同一行的多个页面中抓取数据?

class QuotesSpider(scrapy.Spider):
name = 'quotesbij'
allowed_domains = ['test.com']
start_urls = ['http://test.com']

#page 1 : all url product
def parse(self, response):
    urls = response.css('div.item > div.info > h3 > a::attr(href)').extract()
    for url in urls:
        url = response.urljoin(url)
        yield scrapy.Request(url=url, callback=self.parse_details_product)

#page 2 : scrape title from all url product
def parse_details_product(self, response):
    yield{
    'title': response.css('div.detail-wrap >h1::text').extract(),
    }

#page 2 : scrape url photo (same page of title)
    url_img = response.css('div.ui-image-viewer-thumb-wrap > a::attr(href)')[0].extract()
    for url in url_img:
        url_img = response.urljoin(url_img)
        yield scrapy.Request(url=url_img, callback=self.parse_reviews)

#page 3 : scrape photo
def parse_reviews(self, response):
    yield{
    'img_hd_1': response.css('a > img::attr(src)').extract(),
    }
#The result is good but not the structure, how to scrape data from multiple pages in the same row?

谢谢。

0 个答案:

没有答案