使用python

时间:2019-11-07 12:33:46

标签: python html web-scraping beautifulsoup

此代码从网站获取图像,但是对于某些人来说,我正在获取list index out of range,其中没有img数据。如何克服这一点。已经使用了很多try例外,try-except以外还有其他办法

url =

https://www.redbook.com.au/cars/details/2016-isuzu-d-max-ls-u-high-ride-auto-4x2-my155/SPOT-ITM-445820/

对于没有图像的人,我会收到此错误

list index out of range

喜欢此网址

https://www.redbook.com.au/cars/details/2019-audi-s3-auto-quattro-my19/SPOT-ITM-522293/

如何跳过这种情况

代码:

# -*- coding: utf-8 -*-
import lxml.html as lh
import pandas as pd
import html
from lxml import html
from bs4 import BeautifulSoup
import requests
import requests
from bs4 import BeautifulSoup as bs
import requests
from bs4 import BeautifulSoup as bs
import re
import json

cars = []  # gobal array for storing each car_data object



with open('url.txt') as f:

    # read file without newlines

    urls = f.read().splitlines()



for url in urls:

    car_data = {}  # use it as a local variable
    headers = {'User-Agent': 'Mozilla/5.0'}
    page = requests.get(url, headers=headers)
    tree = html.fromstring(page.content)
    soup = bs(page.content, 'html.parser')



    img_url = tree.xpath('//ul/li/a/img/@src')[0]
    img_url = str(img_url)
    img_url = img_url + '0'
    car_data['image_url'] = img_url
    script = soup.find('script', text=re.compile('CsnInsights.metaData'))
    jsonData = \
    json.loads(script.text.split('CsnInsights.metaData = ')[-1].rsplit(';', 1)[0])



1 个答案:

答案 0 :(得分:2)

您可以应用EAFP principle并处理IndexError,这是在这种情况下抛出的内置异常:

body {
  margin: 0;
}

请注意,当我使用空字符串作为图片网址值时,该值不可用(无法从HTML提取),但是根据您的情况,您可以选择其他值-例如try: img_url = str(tree.xpath('//ul/li/a/img/@src')[0]) + '0' except IndexError: img_url = '' ,或使用None完全跳过对该项目的处理。