我是编程新手,正在尝试在python中构建我的第一个小型网络爬虫。
目标:抓取产品列表页面 - 抓取品牌名称,文章名称,原始价格和新价格 - 保存在CSV文件中
状态:我设法获得了品牌名称,商品名称以及原始价格,并将它们按正确的顺序排列到列表中(例如10个产品)。由于所有商品都有品牌名称,描述和价格,因此我的代码以正确的顺序将它们输入到csv中。
代码:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myUrl = 'https://www.zalando.de/rucksaecke-herren/'
#open connection, grabbing page, saving in page_html and closing connection
uClient = uReq(myUrl)
page_html = uClient.read()
uClient.close()
#Datatype, html paser
page_soup = soup(page_html, "html.parser")
#grabbing information
brand_Names = page_soup.findAll("div",{"class": "z-nvg-cognac_brandName-2XZRz z-nvg-cognac_textFormat-16QFn"})
articale_Names = page_soup.findAll ("div",{"class": "z-nvg-cognac_articleName--arFp z-nvg-cognac_textFormat-16QFn"})
original_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_originalPrice-2Oy4G"})
new_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_promotionalPrice-3GRE7"})
#opening a csv file and printing its header
filename = "XXX.csv"
file = open(filename, "w")
headers = "BRAND, ARTICALE NAME, OLD PRICE, NEW PRICE\n"
file.write(headers)
#How many brands on page?
products_on_page = len(brand_Names)
#Looping through all brands, atricles, prices and writing the text into the CSV
for i in range(products_on_page):
brand = brand_Names[i].text
articale_Name = articale_Names[i].text
price = original_Prices[i].text
new_Price = new_Prices[i].text
file.write(brand + "," + articale_Name + "," + price.replace(",",".") + new_Price.replace(",",".") +"\n")
#closing CSV
file.close()
问题:我正在努力在正确的地方将折扣价格纳入我的csv。不是每个项目都有折扣,我目前看到我的代码有两个问题:
我使用.findAll查找网站上的信息 - 因为产品的折扣产品较少,我的new_Prices包含较少的价格(例如10个产品的3个价格)。如果我能够将它们添加到列表中,我认为它们将显示在前3行中。如何确保将new_Prices添加到正确的产品中?
我得到“索引错误:列表索引超出范围”错误,我假设是由于我循环通过10个产品的事实,但对于new_Prices我更快到达结束然后我的其他名单?这是否有意义并且我的假设是正确的?
我非常感谢任何帮助。
谢谢,
和Thorsten
答案 0 :(得分:0)
由于某些项目没有'div.z-nvg-cognac_promotionalPrice-3GRE7'
标记,因此无法可靠地使用列表索引
但是,您可以选择所有容器标记('div.z-nvg-cognac_infoContainer-MvytX'
)并使用find
选择每个项目上的标记。
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import csv
my_url = 'https://www.zalando.de/sporttaschen-reisetaschen-herren/'
client = urlopen(my_url)
page_html = client.read().decode(errors='ignore')
page_soup = soup(page_html, "html.parser")
headers = ["BRAND", "ARTICALE NAME", "OLD PRICE", "NEW PRICE"]
filename = "test.csv"
with open(filename, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(headers)
items = page_soup.find_all(class_='z-nvg-cognac_infoContainer-MvytX')
for item in items:
brand_names = item.find(class_="z-nvg-cognac_brandName-2XZRz z-nvg-cognac_textFormat-16QFn").text
articale_names = item.find(class_="z-nvg-cognac_articleName--arFp z-nvg-cognac_textFormat-16QFn").text
original_prices = item.find(class_="z-nvg-cognac_originalPrice-2Oy4G").text
new_prices = item.find(class_="z-nvg-cognac_promotionalPrice-3GRE7")
if new_prices is not None:
new_prices = new_prices.text
writer.writerow([brand_names, articale_names, original_prices, new_prices])
如果您希望每页获得超过24个项目,则必须使用运行js的客户端,例如selenium
。
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import csv
my_url = 'https://www.zalando.de/sporttaschen-reisetaschen-herren/'
driver = webdriver.Firefox()
driver.get(my_url)
page_html = driver.page_source
driver.quit()
page_soup = soup(page_html, "html.parser")
...
脚注:
函数和变量的naming conventions是带有下划线的小写
在读取或写入csv文件时,最好使用csv
lib
处理文件时,您可以使用with
语句。