最近我一直致力于一个抓住一个网站的项目。 我的蜘蛛遵循的步骤是:
转到主要问题网站并提取所有可用链接(a,b,c,d)
获取链接后,它将逐个跟踪每个链接并转到其状态页面,其中包含数据。
在进入状态页面之前,它会在url中传递一个查询以获取所需的数据。总的来说,我将不得不两次使用不同的查询访问状态页面。示例:https://example.com?language=14,https://example.com?language=15就是这样。
完成所有工作后,必须以有条理的方式返回数据,例如:
{'data':'1','status_page_data':{'14':[1,2,3],'15'[4,5,6]}
这里[1,2,3]是从状态页面中提取的数据,每个页面逐个查询“14”。
就像这样。但在我的情况下,我得到的数据如下:
{'data':'1','status_page_data':{'14':[],'15'[]}
{'data':'1','status_page_data':{'14':[],'15'[4,5,6]}
每次都在重复。有一点可以肯定的是,我对产量做错了,但我不确定它是什么。
我希望我的数据在所有可用数据和报废时返回一次
示例代码:
import scrapy
class LoginSpider(scrapy.Spider):
def __init__(self):
self.lang = {'114': ' GO', '20': ' D'}
self.temp = ''
name = 'codespider'
# URL to start scrapping
start_urls = ['URL']
def getAnswer(self, response):
data = response.meta['data']
a = response.css('tbody')
'''
This for loop is used to append data if found else passed
self.temp is the paramter passed
'''
for i in a.css('tr'):
try:
data['answer'][self.temp].append(i.css('td::text').extract()[0])
except:
pass
#If next page is found then go on scraping else move out from there
if ('/sites/all/themes/abessive/images/page-next-active.gif' in response.text):
ans = 'STATUS_URL_PAGE_2'
yield scrapy.Request(url = ans, callback = self.getAnswer, meta = {'data': data})
#After everything is done yield all the data
yield data
def parse(self, response):
'''
First extract all the links from the page
'''
for link in response.css('tr.problemrow'):
# Items created
data = {}
data['name'] = link.css('div.problemname b::text').extract()[0]
data['code'] = link.css('td a::text').extract()[2]
data['successfully_submission'] = link.css('td div::text').extract()[2]
data['accuracy'] = link.css('td a::text').extract()[3]
data['answer'] = {}
'''
With given link extracted follow each link status page to extract answer
self.lang.keys here used to pass parameters on URL to get that desired output
And later can be scrapped to append in data['answer']
'''
for i in self.lang.keys():
ans = 'STATUS_PAGE_URL'
self.temp = i
yield scrapy.Request(url = ans, callback = self.getAnswer, meta = {'data': data})