我的任务是使用Selenium和Python上的其他一些软件包从名为Officeworks的页面中对所有项目进行网络抓取。
问题是我在Python方面没有真正的基础,也不知道如何对多个类别的项目进行网络抓取。我只知道如何从一个单一类别(我指定的类别)中提取商品,我想从整个商店中获取所有商品。
我的代码如下所示。
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
import selenium.webdriver.support.ui as ui
import selenium.webdriver.support.expected_conditions as EC
import os
import time
import pandas as pd
from bs4 import BeautifulSoup
import requests
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
os.chdir("C:/Users/cameron.deng/Desktop/Scripts/python scraping")
cwd = os.getcwd()
main_dir = os.path.abspath(os.path.join(cwd, os.pardir))
print('Main Directory:', main_dir)
chromedriver = os.path.abspath(main_dir) + "/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
browser = webdriver.Chrome(chrome_options=options, executable_path=chromedriver)
browser.get("https://www.officeworks.com.au/shop/officeworks/c/education/book/literacy-books")
lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match = False
while (match == False):
lastCount = lenOfPage
time.sleep(10)
lenOfPage = browser.execute_script(
"window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
if lastCount == lenOfPage:
match = True
page = browser.page_source
soup = BeautifulSoup(page, 'html.parser')
while True:
time.sleep(10)
res = pd.DataFrame()
for sec in soup.findAll('div', attrs={'id': 'productList'}):
for div in sec.findAll('div', attrs={'class': 'ow-product-tile product'}):
pId = div["id"]
pCatId = div["data-catentryid"]
pTitle = div.findAll("img")[0]["alt"]
temp = pd.DataFrame({'itemId': [pId], 'itemCatId': [pCatId], 'itemTitle': [pTitle]})
res = res.append(temp)
browser.find_element_by_css_selector('#paginationViewFullWidth > nav > ul > li.ow-pagination__next > a').click()
结果显示在数据框中显示的项目信息。