Scrapy如何以有组织的方式输出数据

时间:2017-05-25 12:03:11

标签: python-3.x scrapy

最近我一直致力于一个抓住一个网站的项目。 我的蜘蛛遵循的步骤是:

  1. 转到主要问题网站并提取所有可用链接(a,b,c,d)

  2. 获取链接后,它将逐个跟踪每个链接并转到其状态页面,其中包含数据。

  3. 在进入状态页面之前,它会在url中传递一个查询以获取所需的数据。总的来说,我将不得不两次使用不同的查询访问状态页面。示例:https://example.com?language=14https://example.com?language=15就是这样。

  4. 在状态页面上,有时会有多个页面(?page = 1,?page = 2和许多查询)。因此,如果状态页面上有更多结果,那么它必须转到下一页并刮取给定数据。
  5. 完成所有工作后,必须以有条理的方式返回数据,例如:

    {'data':'1','status_page_data':{'14':[1,2,3],'15'[4,5,6]}

    这里[1,2,3]是从状态页面中提取的数据,每个页面逐个查询“14”。

    就像这样。但在我的情况下,我得到的数据如下:

    {'data':'1','status_page_data':{'14':[],'15'[]}

    {'data':'1','status_page_data':{'14':[],'15'[4,5,6]}

    每次都在重复。有一点可以肯定的是,我对产量做错了,但我不确定它是什么。

    我希望我的数据在所有可用数据和报废时返回一次

    示例代码:

    import scrapy
    
    class LoginSpider(scrapy.Spider):
    
        def __init__(self):
    
            self.lang = {'114': ' GO', '20': ' D'}
            self.temp = ''
    
        name = 'codespider'
    
        # URL to start scrapping
        start_urls = ['URL']
    
        def getAnswer(self, response):
            data = response.meta['data']
            a = response.css('tbody')
    
            '''
            This for loop is used to append data if found else passed
            self.temp is the paramter passed
            '''
            for i in a.css('tr'):
                try:
                    data['answer'][self.temp].append(i.css('td::text').extract()[0])
                except:
                    pass
    
            #If next page is found then go on scraping else move out from there
            if ('/sites/all/themes/abessive/images/page-next-active.gif' in response.text):
                ans = 'STATUS_URL_PAGE_2'
                yield scrapy.Request(url = ans, callback = self.getAnswer, meta = {'data': data})
    
            #After everything is done yield all the data
            yield data
    
        def parse(self, response):
    
            '''
            First extract all the links from the page
    
            '''
            for link in response.css('tr.problemrow'):
                # Items created
                data = {}
                data['name'] = link.css('div.problemname b::text').extract()[0]
                data['code'] = link.css('td a::text').extract()[2]
                data['successfully_submission'] = link.css('td div::text').extract()[2]
                data['accuracy'] = link.css('td a::text').extract()[3]
                data['answer'] = {}
    
                '''
    
                With given link extracted follow each link status page to extract answer
                self.lang.keys here used to pass parameters on URL to get that desired output
                And later can be scrapped to append in data['answer']
                '''
                for i in self.lang.keys():
                    ans = 'STATUS_PAGE_URL'
                    self.temp = i
                    yield scrapy.Request(url = ans, callback = self.getAnswer, meta = {'data': data})
    

0 个答案:

没有答案