如何从具有多个类别/子类别的网站上抓取(网络抓取)所有产品?

时间:2019-01-29 03:19:15

标签: python selenium selenium-chromedriver

我的任务是使用Selenium和Python上的其他一些软件包从名为Officeworks的页面中对所有项目进行网络抓取。

问题是我在Python方面没有真正的基础,也不知道如何对多个类别的项目进行网络抓取。我只知道如何从一个单一类别(我指定的类别)中提取商品,我想从整个商店中获取所有商品。

我的代码如下所示。

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
import selenium.webdriver.support.ui as ui
import selenium.webdriver.support.expected_conditions as EC
import os
import time
import pandas as pd
from bs4 import BeautifulSoup
import requests

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')

os.chdir("C:/Users/cameron.deng/Desktop/Scripts/python scraping")
cwd = os.getcwd()
main_dir = os.path.abspath(os.path.join(cwd, os.pardir))
print('Main Directory:', main_dir)

chromedriver = os.path.abspath(main_dir) + "/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver

browser = webdriver.Chrome(chrome_options=options, executable_path=chromedriver)
browser.get("https://www.officeworks.com.au/shop/officeworks/c/education/book/literacy-books")



lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match = False
while (match == False):
    lastCount = lenOfPage
    time.sleep(10)
    lenOfPage = browser.execute_script(
        "window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    if lastCount == lenOfPage:
        match = True

page = browser.page_source
soup = BeautifulSoup(page, 'html.parser')



while True:
    time.sleep(10)
    res = pd.DataFrame()
    for sec in soup.findAll('div', attrs={'id': 'productList'}):
        for div in sec.findAll('div', attrs={'class': 'ow-product-tile product'}):
            pId = div["id"]
            pCatId = div["data-catentryid"]
            pTitle = div.findAll("img")[0]["alt"]
            temp = pd.DataFrame({'itemId': [pId], 'itemCatId': [pCatId], 'itemTitle': [pTitle]})
            res = res.append(temp)
        browser.find_element_by_css_selector('#paginationViewFullWidth > nav > ul > li.ow-pagination__next > a').click()

结果显示在数据框中显示的项目信息。

0 个答案:

没有答案