使用bs4抓取html时出现的问题

时间:2020-06-13 09:31:22

标签: python html beautifulsoup

尝试使用下面的python bs4脚本抓取以下html。不断收到错误(在下面列出)。不知道是什么原因造成的?如果有人可以帮助我弄清楚如何使其工作,那将是很棒的事情!

<span id="prodInfoPriceVat" class="prodInfoPriceVat" data-price-vat="24.73">£24.73</span>

Python BS4脚本:

prices = {

    "GLDAG_MAPLE":        {"url":    "https://www.gold.co.uk/silver-coins/candian-silver-maple-coins/1oz-canadian-maple-silver-coin-2020/",
                           "trader": "Gold.co.uk",
                           "metal":  "Silver",
                           "type":   "Maple"},
    "BBPAG_MAPLE":        {"url": "https://www.bullionbypost.co.uk/silver-coins/canadian-maple-1oz-silver-coin/2019-1oz-canadian-maple-silver-coin/",
                           "trader": "Bullion By Post",
                           "metal":  "Silver",
                           "type":   "Maple"},
    "ATKAG_BRITANNIA":    {"url": "https://atkinsonsbullion.com/silver/silver-coins/1oz-silver-coins/2020-uk-britannia-1oz-silver-coin",
                           "trader": "Atkinsons Bullion",
                           "metal":  "Silver",
                           "type":   "Britannia"},
}

response = requests.get(
    'https://www.bullionbypost.co.uk/silver-price/silver-price-per-gram/')
soup = BeautifulSoup(response.text, 'html.parser')
AG_GRAM_SPOT = soup.find(
    'span', {'name': 'current_price_field'}).get_text()

# Convert to float
AG_GRAM_SPOT = float(re.sub(r"[^0-9\.]", "", AG_GRAM_SPOT))
# No need for another lookup
AG_OUNCE_SPOT = AG_GRAM_SPOT * 31.1035

for coin in prices:
    response = requests.get(prices[coin]["url"])
    soup = BeautifulSoup(response.text, 'html.parser')

    try:
        text_price = soup.find(
            'td', {'id': 'price-inc-vat-per-unit-1'}).get_text()         # BullionByPost
    except:
        text_price = soup.find(
            'td', {'id': 'total-price-inc-vat-1'}).get_text()            # Gold.co.uk
    else:
        text_price = soup.find(
            'span', {'class': 'prodInfoPriceVat'}).get_text()         # Issues here!Line 70

    # Grab the number
    prices[coin]["price"] = float(re.sub(r"[^0-9\.]", "", text_price))

继续收到此错误:如何解决?

Traceback (most recent call last):
  File "scraper.py", line 70, in <module>
    text_price = soup.find(
AttributeError: 'NoneType' object has no attribute 'get_text'

我该如何工作?

1 个答案:

答案 0 :(得分:1)

这里不需要使用异常,只需使用if..else并测试找到的元素是否不是None

例如:

import re
import requests
from bs4 import BeautifulSoup

prices = {

    "GLDAG_MAPLE":        {"url":    "https://www.gold.co.uk/silver-coins/candian-silver-maple-coins/1oz-canadian-maple-silver-coin-2020/",
                           "trader": "Gold.co.uk",
                           "metal":  "Silver",
                           "type":   "Maple"},
    "BBPAG_MAPLE":        {"url": "https://www.bullionbypost.co.uk/silver-coins/canadian-maple-1oz-silver-coin/2019-1oz-canadian-maple-silver-coin/",
                           "trader": "Bullion By Post",
                           "metal":  "Silver",
                           "type":   "Maple"},
    "ATKAG_BRITANNIA":    {"url": "https://atkinsonsbullion.com/silver/silver-coins/1oz-silver-coins/2020-uk-britannia-1oz-silver-coin",
                           "trader": "Atkinsons Bullion",
                           "metal":  "Silver",
                           "type":   "Britannia"},
}

response = requests.get(
    'https://www.bullionbypost.co.uk/silver-price/silver-price-per-gram/')
soup = BeautifulSoup(response.text, 'html.parser')
AG_GRAM_SPOT = soup.find(
    'span', {'name': 'current_price_field'}).get_text()

# Convert to float
AG_GRAM_SPOT = float(re.sub(r"[^0-9\.]", "", AG_GRAM_SPOT))
# No need for another lookup
AG_OUNCE_SPOT = AG_GRAM_SPOT * 31.1035

for coin in prices:
    print('url=', prices[coin]["url"])
    response = requests.get(prices[coin]["url"])
    soup = BeautifulSoup(response.text, 'html.parser')

    text_price = soup.find(
        'td', {'id': 'price-inc-vat-per-unit-1'})        # BullionByPost

    if not text_price:
        text_price = soup.find(
            'td', {'id': 'total-price-inc-vat-1'})       # Gold.co.uk

    if not text_price:
        text_price = soup.find(
            'span', {'class': 'prodInfoPriceVat'})       # atkinsonsbullion.com

    if not text_price:
        print('Error, unable to fint price for url=', prices[coin]["url"])
        prices[coin]["price"] = float('nan')
        continue

    text_price = text_price.get_text(strip=True)

    # Grab the number
    prices[coin]["price"] = float(re.sub(r"[^0-9\.]", "", text_price))
    print('price=', prices[coin]["price"])

打印:

url= https://www.gold.co.uk/silver-coins/candian-silver-maple-coins/1oz-canadian-maple-silver-coin-2020/
price= 31.32
url= https://www.bullionbypost.co.uk/silver-coins/canadian-maple-1oz-silver-coin/2019-1oz-canadian-maple-silver-coin/
price= 26.88
url= https://atkinsonsbullion.com/silver/silver-coins/1oz-silver-coins/2020-uk-britannia-1oz-silver-coin
price= 24.73