Python web scraping - 循环遍历所有类别和子类别

时间:2017-11-30 06:42:52

标签: python beautifulsoup

我正在尝试检索零售网站中的所有类别和子类别。一旦我进入该类别,我就可以使用BeautifulSoup来提取该类别中的每一件产品。但是,我对类别的循环很困难。我将其用作测试网站https://www.uniqlo.com/us/en/women

如何循环浏览网站左侧的每个类别以及子类别?问题是您必须在网站显示所有子类别之前单击该类别。我想将category / subcategory中的所有产品提取到csv文件中。这就是我到目前为止所做的:

import bs4
import json
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

myurl = 'https://www.uniqlo.com/us/en/women/'
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
filename = "products.csv"
file = open(filename,"w",newline='')
product_list = []

containers = page_soup.findAll("li",{"class" : lambda L: L and 
L.startswith('grid-tile')})   #Find all li with class: grid-tile

for container in containers: 

product_container = container.findAll("div",{"class":"product-swatches"})   
product_names = product_container[0].findAll("li")

    for i in range(len(product_names)):

    try:
        product_name = product_names[i].a.img.get("alt")
        product_mod_name = product_name.split(',')[0].lstrip()
        print(product_mod_name)
    except:
        product_name = ''

    i +=1    

product = [product_mod_name]
print(product)    
product_list.append(product)

import csv

with open('products.csv','a',newline='') as file:        
    writer=csv.writer(file)
    for row in product_list:
        writer.writerow(row)

1 个答案:

答案 0 :(得分:0)

您可以尝试使用此脚本。它将通过产品的不同类别和子类别,并解析它们的标题和价格。有几种产品具有相同的名称,它们之间的唯一区别是颜色。所以,不要将它们视为重复。我已经以非常紧凑的方式编写了脚本,因此根据您的舒适度进行拉伸:

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.uniqlo.com/us/en/women')
soup = BeautifulSoup(res.text, "lxml")

for items in soup.select("#category-level-1 .refinement-link"):
    page = requests.get(items['href'])
    broth = BeautifulSoup(page.text,"lxml")

    for links in broth.select("#category-level-2 .refinement-link"):
        req = requests.get(links['href'])
        sauce = BeautifulSoup(req.text,"lxml")

        for data in sauce.select(".product-tile-info"):
            title = data.select(".name-link")[0].text
            price = ' '.join([item.text for item in data.select(".product-pricing span")])
            print(title.strip(),price.strip())

结果如下:

WOMEN CASHMERE CREW NECK SWEATER $79.90
Women Extra Fine Merino Crew Neck Sweater $29.90 $19.90
WOMEN KAWS X PEANUTS LONG-SLEEVE HOODED SWEATSHIRT $19.90