刮擦同一网站的多个部分

时间:2020-06-03 18:15:20

标签: python python-3.x web-scraping scrapy web-crawler

我正在尝试从网站的其他部分中抓取数据,我正在尝试抓取,但是问题是我得到的是,当我对它进行分页时,它只是以随机顺序而不是以串行顺序转到不同的页码。

    import scrapy
    import re
    from webpreview import OpenGraph
    import json
    from newspaper import Article 

    class IndiaTodaySpider(scrapy.Spider):
        name = 'indiatoday_story'

        def start_requests(self): 

            urls = ['https://www.indiatoday.in/movies/celebrities?page={}',
            'https://www.indiatoday.in/movies/bollywood?page={}',
            'https://www.indiatoday.in/movies/hollywood?page={}',
            'https://www.indiatoday.in/movies/regional-cinema?page={}',
            # 'https://www.indiatoday.in/movies/standpoint?page={}',
            #  'https://www.indiatoday.in/movies/gossip?page={}
            ]
            ur = []
            for url in urls:
                for i in range(0,4):
                    x = url.format(i)
                    yield scrapy.Request(url=x, callback=self.parse, dont_filter=True)

        def parse(self, response):
            all_hrefs = response.xpath('/html/body/div[1]/main/div/section/div[3]/div[1]/div[*]/div[2]/h2/a/@href').getall()
            # print('---------------------------------------------')
            # print('*********************************************')
            # print(all_hrefs)
            # print('*********************************************')
            # print('---------------------------------------------')


            for i in all_hrefs :
                print('---------------------------------------------')
                print('*********************************************')
                print('https://www.indiatoday.in' + i)
                print('*********************************************')
                print('---------------------------------------------')
                og = OpenGraph('https://www.indiatoday.in' + i, ["og:title", "og:description", "og:image", "og:url"])
                print('---------------------------------------------')
                print('*********************************************')
                print(og.title)
                print(og.description)
                print(og.image)
                print(og.url)
                print('*********************************************')
                print('---------------------------------------------')

                yield {
                    "page_title": og.title,
                    "description": og.description,
                    "image_url": og.image,
                    "post_url": og.url
                } 

            pass

这是响应页面的输出,我得到的不是输出顺序为0、1、2、3,而是像1、0、2、3或1、0、3、2那样,为什么会这样,有人可以帮我吗?

2020-06-03 23:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/celebrities?page=0> (referer: None)
2020-06-03 23:32:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/celebrities?page=1> (referer: None)
2020-06-03 23:32:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/celebrities?page=2> (referer: None)
2020-06-03 23:32:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/celebrities?page=3> (referer: None)
2020-06-03 23:32:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/bollywood?page=1> (referer: None)
2020-06-03 23:32:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/bollywood?page=0> (referer: None)
2020-06-03 23:32:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/bollywood?page=2> (referer: None)
2020-06-03 23:32:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/bollywood?page=3> (referer: None)
2020-06-03 23:32:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/hollywood?page=1> (referer: None)
2020-06-03 23:32:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/hollywood?page=0> (referer: None)
2020-06-03 23:32:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/hollywood?page=3> (referer: None)
2020-06-03 23:32:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/hollywood?page=2> (referer: None)
2020-06-03 23:32:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/regional-cinema?page=0> (referer: None)
2020-06-03 23:32:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/regional-cinema?page=1> (referer: None)
2020-06-03 23:33:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/regional-cinema?page=2> (referer: None)
2020-06-03 23:33:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/regional-cinema?page=3> (referer: None)

0 个答案:

没有答案