Question

如何从同一行的多个页面中抓取数据？

main page > next page (scrape title) > sub page (scrape img)

在我的例子中

：

all product > product page n°1 (scrape title) > sub page product n°1(scrape img)

            > product page n°2 (scrape title) > sub page product n°2 (scrape img)

结果我的json（坏）：

第1行title_1 = ........（产品编号1）
第2行img_hd_1 = ........（产品编号1）
第3行title_1 = ........（产品编号2）
第4行img_hd_1 = ........（产品编号2）

期望的结果：

第1行title_1 = ........，img_hd_1 = ........（产品编号1）
第2行title_1 = ........，img_hd_1 = ........（产品编号2）

结果很好但不是结构。如何从同一行的多个页面中抓取数据？

class QuotesSpider(scrapy.Spider):
name = 'quotesbij'
allowed_domains = ['test.com']
start_urls = ['http://test.com']

#page 1 : all url product
def parse(self, response):
    urls = response.css('div.item > div.info > h3 > a::attr(href)').extract()
    for url in urls:
        url = response.urljoin(url)
        yield scrapy.Request(url=url, callback=self.parse_details_product)

#page 2 : scrape title from all url product
def parse_details_product(self, response):
    yield{
    'title': response.css('div.detail-wrap >h1::text').extract(),
    }

#page 2 : scrape url photo (same page of title)
    url_img = response.css('div.ui-image-viewer-thumb-wrap > a::attr(href)')[0].extract()
    for url in url_img:
        url_img = response.urljoin(url_img)
        yield scrapy.Request(url=url_img, callback=self.parse_reviews)

#page 3 : scrape photo
def parse_reviews(self, response):
    yield{
    'img_hd_1': response.css('a > img::attr(src)').extract(),
    }
#The result is good but not the structure, how to scrape data from multiple pages in the same row?

谢谢。

如何从同一行中的多个页面中抓取数据

0 个答案: