如何处理网页抓取中缺少元素的网页?请参阅下面的代码

时间:2021-03-19 09:48:14

标签: python python-3.x web-scraping beautifulsoup

我编写这段代码是为了从 Flipkart 的手机类别中抓取数据。我面临的问题是当元素存在时出现属性错误(“AttributeError: 'NoneType' object has no attribute 'text'”)。如何修改此代码才能工作。如果存在元素,我需要将数据填充为“不可用。请参阅下面的代码。我是编程初学者,将不胜感激。

'''

导入请求

从 bs4 导入 BeautifulSoup

导入 csv

重新导入

base_url = "https://www.flipkart.com/search?q=mobiles&page="

def get_urls(): with open("fliplart-data.csv", "a") as csv_file:

    writer = csv.writer(csv_file)

    writer.writerow(
        ['Product_name', 'Price', 'Rating', 'Product-url'])

    for page in range(1, 510):

        page = base_url + str(page)

        response = requests.get(page).text

        soup = BeautifulSoup(response, 'lxml')

        for product_urls in soup.find_all('a', href=True, attrs={'class': '_1fQZEK'}):
            name = product_urls.find('div', attrs={'class': '_4rR01T'}).text


            price = product_urls.find('div', attrs={'class': '_30jeq3 _1_WHN1'}).text
            price = re.split("\₹", price)
            price = price[-1]


            rating = product_urls.find('div', attrs={'class': '_3LWZlK'}).text


            item_url = soup.find('a', class_="_1fQZEK", target="_blank")['href']

            item_url = " https://www.flipkart.com" + item_url

            item_url = re.split("\&", item_url)

            item_url = item_url[0]


            print(f'Product name is {name}')

            print(f'Product price is {price}')

            print(f'Product rating is {rating}')

            print(f'Product url is {item_url}')


            writer.writerow(
                [name, price, rating, item_url])

get_urls()

'''

1 个答案:

答案 0 :(得分:0)

看起来您可能试图用 try/catch 异常处理包围字符串,如果有这样的 AttributeError,并使用 except 块将字符串设置为“不可用”时有一个例外。

import requests

from bs4 import BeautifulSoup

import csv

import re

base_url = "https://www.flipkart.com/search?q=mobiles&page="

def get_urls(): 
    csv_file = open("fliplart-data.csv", "a")
    writer = csv.writer(csv_file)

    writer.writerow(
        ['Product_name', 'Price', 'Rating', 'Product-url'])

    for page in range(1, 510):

        page = base_url + str(page)

        response = requests.get(page).text

        soup = BeautifulSoup(response, 'lxml')

        for product_urls in soup.find_all('a', href=True, attrs={'class': '_1fQZEK'}):
            
            #name
            try:
                name = product_urls.find('div', attrs={'class': '_4rR01T'}).text
            except Exception as e:
                name = "Not Available"

            #price
            try:
                price = product_urls.find('div', attrs={'class': '_30jeq3 _1_WHN1'}).text
                price = re.split("\₹", price)
                price = price[-1]
            except Exception as e:
                price = "Not Available"

            #rating
            try:
                rating = product_urls.find('div', attrs={'class': '_3LWZlK'}).text
            except Exception as e:
                rating = "Not Available"
            #item_url
            try:
                item_url = soup.find('a', class_="_1fQZEK", target="_blank")['href']
                item_url = " https://www.flipkart.com" + item_url
                item_url = re.split("\&", item_url)
                item_url = item_url[0]
            except Exception as e:
                item_url = "Not Available"

            print(f'Product name is {name}')
            print(f'Product price is {price}')
            print(f'Product rating is {rating}')
            print(f'Product url is {item_url}')


            writer.writerow(
                [name, price, rating, item_url])

get_urls()

输出

Product name is intaek 5616
Product price is 789
Product rating is Not Available
Product url is  https://www.flipkart.com/kxd-m1/p/itm89bbc238d6356?pid=MOBFUXKG3DYVZRQV

从您抓取的结果来看,实际数据与它所说的网址不匹配。这可能也是您遇到的问题的一部分。