我正在尝试从网站的其他部分中抓取数据,我正在尝试抓取,但是问题是我得到的是,当我对它进行分页时,它只是以随机顺序而不是以串行顺序转到不同的页码。
import scrapy
import re
from webpreview import OpenGraph
import json
from newspaper import Article
class IndiaTodaySpider(scrapy.Spider):
name = 'indiatoday_story'
def start_requests(self):
urls = ['https://www.indiatoday.in/movies/celebrities?page={}',
'https://www.indiatoday.in/movies/bollywood?page={}',
'https://www.indiatoday.in/movies/hollywood?page={}',
'https://www.indiatoday.in/movies/regional-cinema?page={}',
# 'https://www.indiatoday.in/movies/standpoint?page={}',
# 'https://www.indiatoday.in/movies/gossip?page={}
]
ur = []
for url in urls:
for i in range(0,4):
x = url.format(i)
yield scrapy.Request(url=x, callback=self.parse, dont_filter=True)
def parse(self, response):
all_hrefs = response.xpath('/html/body/div[1]/main/div/section/div[3]/div[1]/div[*]/div[2]/h2/a/@href').getall()
# print('---------------------------------------------')
# print('*********************************************')
# print(all_hrefs)
# print('*********************************************')
# print('---------------------------------------------')
for i in all_hrefs :
print('---------------------------------------------')
print('*********************************************')
print('https://www.indiatoday.in' + i)
print('*********************************************')
print('---------------------------------------------')
og = OpenGraph('https://www.indiatoday.in' + i, ["og:title", "og:description", "og:image", "og:url"])
print('---------------------------------------------')
print('*********************************************')
print(og.title)
print(og.description)
print(og.image)
print(og.url)
print('*********************************************')
print('---------------------------------------------')
yield {
"page_title": og.title,
"description": og.description,
"image_url": og.image,
"post_url": og.url
}
pass
这是响应页面的输出,我得到的不是输出顺序为0、1、2、3,而是像1、0、2、3或1、0、3、2那样,为什么会这样,有人可以帮我吗?
2020-06-03 23:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/celebrities?page=0> (referer: None)
2020-06-03 23:32:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/celebrities?page=1> (referer: None)
2020-06-03 23:32:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/celebrities?page=2> (referer: None)
2020-06-03 23:32:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/celebrities?page=3> (referer: None)
2020-06-03 23:32:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/bollywood?page=1> (referer: None)
2020-06-03 23:32:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/bollywood?page=0> (referer: None)
2020-06-03 23:32:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/bollywood?page=2> (referer: None)
2020-06-03 23:32:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/bollywood?page=3> (referer: None)
2020-06-03 23:32:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/hollywood?page=1> (referer: None)
2020-06-03 23:32:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/hollywood?page=0> (referer: None)
2020-06-03 23:32:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/hollywood?page=3> (referer: None)
2020-06-03 23:32:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/hollywood?page=2> (referer: None)
2020-06-03 23:32:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/regional-cinema?page=0> (referer: None)
2020-06-03 23:32:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/regional-cinema?page=1> (referer: None)
2020-06-03 23:33:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/regional-cinema?page=2> (referer: None)
2020-06-03 23:33:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indiatoday.in/movies/regional-cinema?page=3> (referer: None)