使用BeautifulSoup进行网页抓取

时间:2020-09-26 03:06:29

标签: python web-scraping beautifulsoup python-beautifultable

我是新手,我正在尝试将该网页的所有产品的所有子页面(1-8)链接:https://www.sodimac.cl/sodimac-cl/category/scat359268/Esmaltes-al-agua

我在每个页面上都有一个循环,但是由于第7页的某些原因,它仅带来20种产品,而第8页没有产品

此功能可为我提供每个页面上每个产品的所有URL:

def get_all_product_url(base_url):
    # Set up link and gets all URLs
    page = requests.get(base_url, stream=True)
    soup = BeautifulSoup(page.content, 'html.parser',from_encoding='utf-8')
    url_list = []
    try:
        products = soup.find_all('div', {'class':'jsx-3418419141 product-thumbnail'})
    except:
        return url_list
    for i in products:
        url = i.find("a").get('href')
        if 'https://www.sodimac.cl' in url:
            url_list.append(url)
        else:
            url_list.append('https://www.sodimac.cl'+url)
    # Return all web address without duplicates
    return list(set(url_list))

运行第8页时,我会得到一个完整的列表

base_url = "https://www.sodimac.cl/sodimac-cl/category/scat359268/Esmaltes-al-agua?currentpage=8"
page = requests.get(base_url, stream=True)
soup = BeautifulSoup(page.content, 'html.parser',from_encoding='utf-8')
url_list = get_all_product_url(base_url)
url_list

如果在第1页上运行它,您将获得28个条目

base_url = "https://www.sodimac.cl/sodimac-cl/category/scat359268/Esmaltes-al-agua?currentpage=1"
page = requests.get(base_url, stream=True)
soup = BeautifulSoup(page.content, 'html.parser',from_encoding='utf-8')
url_list = get_all_product_url(base_url)
url_list

任何帮助,我都会感激不尽。

谢谢

0 个答案:

没有答案