Question

我希望脚本从每个页面中抓取所有项目并附加到csv文件但是有2个问题：

1）当我运行脚本时，它只转到单页（最后一页= 64）。它不会从第1页爬到64

2）当脚本将数据写入csv文件时，它不会附加新行，但会重写整个csv文件。

import csv
# YouTube Video: https://www.youtube.com/watch?v=zjo9yFHoUl8
from selenium import webdriver

MAX_PAGE_NUM = 67
MAX_PAGE_DIG = 1

driver = webdriver.Chrome('/Users/reezalaq/PycharmProjects/untitled2/venv/driver/chromedriver')

with open('result.csv', 'w') as f:
    f.write("Product Name, Sale Price, Discount, Old Price \n")

for i in range(1, MAX_PAGE_NUM + 1):
    page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)

url = "https://www.blibli.com/jual/batik-pria?s=batik+pria&c=BA-1000013&i=" + page_num


driver.get(url)


buyers = driver.find_elements_by_xpath("//div[@class='product-title']")
prices = driver.find_elements_by_xpath("//span[@class='new-price-text']")
discount = driver.find_elements_by_xpath("//div[@class='discount']")
oldprice = driver.find_elements_by_xpath("//span[@class='old-price-text']")


num_page_items = len(buyers)
with open('result.csv', 'a') as f:
    for c in range(num_page_items):
        f.write(buyers[c].text + ' , ' + prices[c].text + ' , ' + discount[c].text + ' , ' + oldprice[c].text + '\n')


driver.close()

Answer 1

如果要在文件中添加新行，则必须使用“a”参数而不是“w”。

with open('result.csv', 'a') as f:
    f.write("Product Name, Sale Price, Discount, Old Price \n")

“w”选项的定义：

打开文件仅供写入。如果文件存在，则覆盖文件。如果该文件不存在，则创建一个用于写入的新文件。

“a”选项的定义：

打开要追加的文件。文件指针位于文件的末尾如果文件存在。也就是说，文件处于追加模式。如果文件不存在，它会创建一个新文件进行编写。

“ab”选项的定义：

打开一个文件，以二进制格式附加。文件指针位于如果文件存在，则文件的结尾。也就是说，文件在追加模式。如果该文件不存在，则为其创建新文件写入。

因此，要添加新行，您必须使用包含“a”（附加选项）的选项。

定义见this answer。

Answer 2

您遇到的主要问题是缩进问题，即使用页面上的最后一个对象运行您的脚本。

我看到的另一个问题是你只是把所有的标题放在一起，把所有的旧价格放在一起等等。

由于这个原因，很难理解哪个价格属于哪个项目，例如，数据丢失的项目。

要解决此问题，我已将所有项目放在一个网页中的变量“产品”中。

关于我的实现中CSV的“附加”或“写入”选项，如果result.csv文件存在，我首先检查。

然后我们有两个案例：

result.csv不存在：我创建它并将标题放在
result.csv已经存在：这意味着标头已经到位，我可以在循环时添加新行

为了轻松获取数据，我使用了BeautifulSoup（使用pip轻松安装）。

由于此网页中的数据不一致，但未来还存在一些挑战，但以下示例应足以让您前进。

请注意，代码中的“中断”将停止第1页的抓取。

import csv
# YouTube Video: https://www.youtube.com/watch?v=zjo9yFHoUl8
from selenium import webdriver
from bs4 import BeautifulSoup
import os.path

MAX_PAGE_NUM = 67
MAX_PAGE_DIG = 1

driver = webdriver.Chrome('/Users/reezalaq/PycharmProjects/untitled2/venv/driver/chromedriver')
#driver = webdriver.Chrome()

def write_csv_header():
    with open('result.csv', 'w') as f:
        f.write("Product Name, Sale Price, Discount, Old Price \n")

def write_csv_row(product_title, product_new_price, product_discount, product_old_price, product_link):
    with open('result.csv', 'a') as f:
        f.write(product_title + ' , ' + product_new_price + ' , ' + product_discount + ' , ' + product_old_price + ' , ' + product_link + '\n')

if os.path.isfile('result.csv'):
    write_csv_header()

for i in range(1, MAX_PAGE_NUM + 1):
    page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
    url = "https://www.blibli.com/jual/batik-pria?s=batik+pria&c=BA-1000013&i=" + page_num
    driver.get(url)
    source = driver.page_source
    soup = BeautifulSoup(source, 'html.parser')
    products = soup.findAll("a", {"class": "single-product"})
    for product in products:
        try:
            product_title = product.find("div", {"class": "product-title"}).text.strip()
        except:
            product_title = "Not available"
        try:
            product_new_price = product.find("span", {"class": "new-price-text"}).text.strip()
        except:
            product_new_price = "Not available"
        try:
            product_old_price = product.find("span", {"class": "old-price-text"}).text.strip()
        except:
            product_old_price = "Not available"
        try:
            product_discount = product.find("div", {"class": "discount"}).text.strip()
        except:
            product_discount = "Not available"
        try:
            product_link = product['href']
        except:
            product_link = "Not available"
        write_csv_row(product_title, product_new_price, product_discount, product_old_price, product_link)
    break # this stops the parsing at the 1st page. I think it is a good idea to check data and fix all discrepancies before proceeding

driver.close()

Selenium Python - 无法从一个页面爬到另一个页面

2 个答案: