异步循环混合元项目

时间:2019-03-29 21:24:11

标签: web-scraping scrapy web-crawler

我正在尝试抓取以下网站:

https://institucional.xpi.com.br/sobre-a-xp/encontre-um-escritorio/

我有一个下拉列表可以选择一个州,从该州可以从可用城市中选择一个下拉列表。

提交后,我会得到该城市的所有办公室清单(地址,电子邮件,电话号码)。

使用此代码,我并没有得到所有结果,也没有得到重复的城市名称,看起来像元项目是从一个循环中混合在一起的。 我尝试调试,但是会发生以下情况:

我启动第一个解析函数,当我进入每个状态的循环时,当我到达屈服线时,我得到了第一个状态(“ AC”),我希望它进入parseStates函数,但是它启动了再次循环。

问题是,它并没有完成整个循环,它循环了前五个状态,然后跳过了parseStates函数。

def parse(self, response):

    statesList = ["AC","AL","AM","BA","CE","DF","ES","GO","MA","MG","MS","MT","PA","PB","PE","PR","RJ","RN","RO","RS","SC","SE","SP"]

    for state in statesList:
        linkState = 'https://institucional.xpi.com.br/api/Escritorios/FilialListarCidadesV2?vSiglaEstado=' + state
        location = LocationItem()
        location['state']=state

        yield scrapy.Request(url=linkState, callback=self.parseStates, meta={'item':location})

def parseStates(self,response):
        location=response.meta['item']

        root = ET.fromstring(response.body)
        cityList = [city.text for city in root.iter('{http://schemas.datacontract.org/2004/07/XP.Portal.Entities}Nome')]

        for city in cityList:
            location['city']=city
            state = location['state']

            linkCity = 'https://institucional.xpi.com.br/api/Escritorios/FilialListarPorEstadoCidadeV2?vSiglaEstado=' + state + '&vNomeCidade='+city.replace(' ','%20')
            yield scrapy.Request(url=linkCity, callback=self.parseCities,meta={'item':location})

def parseCities(self,response):
        location = response.meta['item']
        state = location['state']
        city = location['city']

        root = ET.fromstring(response.body)

        mailList = [elem.text for elem in root.iter('{http://schemas.datacontract.org/2004/07/XP.Portal.Entities}EmailPadronizadoSocioResponsavel')]
        companyList = [elem.text for elem in root.iter('{http://schemas.datacontract.org/2004/07/XP.Portal.Entities}RazaoSocial')]
        contactList = [elem.text for elem in root.iter('{http://schemas.datacontract.org/2004/07/XP.Portal.Entities}SocioResponsavel')]
        telList = [elem.text for elem in root.iter('{http://schemas.datacontract.org/2004/07/XP.Portal.Entities}Telefone')]

        for i in range(len(mailList)):
            write(state,city,companyList[i],contactList[i],mailList[i],telList[i])

0 个答案:

没有答案