Web刮板返回多个错误

时间:2019-08-08 15:18:17

标签: python web-scraping

我正在做一个保险网页的网页抓取程序,以CSV格式检索我的模型,品牌,子品牌和说明,当我运行我的代码时,它有时会起作用,而有时我会遇到多个错误(“列出索引必须为整数”,“期望值:第1行第1列”,“ JSON解码器无法正常工作”)

我尝试插入打印件并尝试查看问题出在哪里,但仍然没有解决。

import requests
import time
import json


session = requests.Session()
request_marcas = session.get('https://www.citibanamexchubb.com/api/chubbnet/auto/brands-subbrands')
data = request_marcas.json()
fileCSV = open("webscraper_test.csv", "a")
fileCSV.write('Modelo' + ';' + 'ID_Marca' + ";" + 'ID_Submarca' + ";" + "ID_Tipo" + ";" + "Marca" +";"+ "Tipo"+ 'Descripcion' + "\n")

for i in range(2019, 2020):
        for marca in data['MARCA']:
            for submarca in marca['SUBMARCAS']:
                modelos = []
                modelos.append('https://www.citibanamexchubb.com/api/chubbnet/auto/models/' + marca['ID'] + '/' + submarca['ID'] + '/' + str(i))
                for link in modelos:
                    json_link = []
                    request_link = session.get(link).json()
                    json_link.append(request_link)
                    #print(request_link)
                    for desc_id in request_link['TIPO']:
                        #print(desc_id['ID'])
                        desc_detail = []
                        desc_detail.append(session.get('https://www.citibanamexchubb.com/api/chubbnet/auto/descriptions/' + desc_id['ID'] + '/2018').json())
                        #print(desc_detail)
                        try:
                            for desc in desc_detail['DESCRIPCION']:
                                print(desc['DESC'])
                        except Exception as e:
                            None

1 个答案:

答案 0 :(得分:2)

因此,要抓取的auto / models端点中存在一些怪异的变化。例如,https://www.citibanamexchubb.com/api/chubbnet/auto/models/7/8/2019返回以下内容:

{
  "TIPO": {
    "ID": "381390223",
    "DESC": "MINI COOPER"
  }
}

https://www.citibanamexchubb.com/api/chubbnet/auto/models/1/1/2019返回以下内容:

{
  "TIPO": [
    {
      "ID": "364026215",
      "DESC": "MDX"
    },
    {
      "ID": "364026216",
      "DESC": "RDX"
    },
    {
      "ID": "364031544",
      "DESC": "ILX"
    },
    {
      "ID": "364031613",
      "DESC": "TLX"
    },
    {
      "ID": "364031674",
      "DESC": "NSX"
    }
  ]
}

因此,在第一个中,“ TIPO”是字典,而在第二个中,“ TIPO”是列表。我对您的脚本进行了修改,以使其运行而不会引发任何错误。我确定这不是您想要的,但至少可以处理两种类型之间的差异:

import requests
import time
import json


session = requests.Session()
request_marcas = session.get('https://www.citibanamexchubb.com/api/chubbnet/auto/brands-subbrands')
data = request_marcas.json()
fileCSV = open("webscraper_test.csv", "a")
fileCSV.write('Modelo' + ';' + 'ID_Marca' + ";" + 'ID_Submarca' + ";" + "ID_Tipo" + ";" + "Marca" +";"+ "Tipo"+ 'Descripcion' + "\n")

for i in range(2019, 2020):
        for marca in data['MARCA']:
            for submarca in marca['SUBMARCAS']:
                modelos = []
                modelos.append('https://www.citibanamexchubb.com/api/chubbnet/auto/models/' + marca['ID'] + '/' + submarca['ID'] + '/' + str(i))
                for link in modelos:
                    json_link = []
                    request_link = session.get(link).json()
                    json_link.append(request_link)
                    #print(request_link)

                    # here's where I've made some changes:
                    desc_detail = []
                    if isinstance(request_link['TIPO'], dict):
                        desc_detail.append(session.get(
                            'https://www.citibanamexchubb.com/api/chubbnet/auto/descriptions/' + request_link['TIPO'][
                                'ID'] + '/2018').json())
                        print(request_link['TIPO']['DESC'])
                    elif isinstance(request_link['TIPO'], list):
                        for item in request_link['TIPO']:
                            desc_detail.append(session.get('https://www.citibanamexchubb.com/api/chubbnet/auto/descriptions/' + item['ID'] + '/2018').json())
                            print(item['DESC'])

希望有帮助!