用美丽的汤刮擦网站。但是它无法刮擦每个<li>标签

时间:2020-08-29 20:29:35

标签: python html selenium web-scraping beautifulsoup

我是一个完整的初学者,我在网络抓取中遇到了一些问题,我能够抓取图片,标题和价格。并且成功获取了[0]处的索引。但是,每当我尝试运行循环或以大于0的值对索引进行硬编码时,它都表明它超出范围。并且不会刮擦其他任何<li>标签。还有其他方法可以解决此问题吗?另外,我合并了硒以便加载整个页面。任何帮助将不胜感激。

from selenium import webdriver
from bs4 import BeautifulSoup
import time


PATH = "C:\Program Files (x86)\chromedriver.exe"

driver = webdriver.Chrome(PATH)

driver.get("https://ca.octobersveryown.com/collections/all")

scrolls = 22
while True:
    scrolls -= 1
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(0.2)
    if scrolls < 0:
        break

html = driver.page_source

soup = BeautifulSoup(html, 'html.parser')

bodies = (soup.find(id='content'))

clothing = bodies.find_all('ul', class_='grid--full product-grid-items')

for span_tag in soup.findAll(class_='visually-hidden'):
    span_tag.replace_with('')

print(clothing[0].find('img')['src'])
print(clothing[0].find(class_='product-title').get_text())
print(clothing[0].find(class_='grid-price-money').get_text())

time.sleep(8)

driver.quit()

1 个答案:

答案 0 :(得分:0)

如果只想使用BeautifulSoup而不使用硒,则可以模拟该页面发出的Ajax请求。例如:

import requests
from bs4 import BeautifulSoup


url = 'https://uk.octobersveryown.com/collections/all?page={page}&view=pagination-ajax'

page = 1
while True:
    soup = BeautifulSoup( requests.get(url.format(page=page)).content, 'html.parser' )

    li = soup.find_all('li', recursive=False)
    if not li:
        break

    for l in li:
        print(l.select_one('p a').get_text(strip=True))
        print('https:' + l.img['src'])
        print(l.select_one('.grid-price').get_text(strip=True, separator=' '))
        print('-' * 80)

    page += 1

打印:

LIGHTWEIGHT RAIN SHELL
https://cdn.shopify.com/s/files/1/1605/0171/products/lightweight-rain-shell-dark-red-1_large.jpg?v=1598583974
£178.00
--------------------------------------------------------------------------------
LIGHTWEIGHT RAIN SHELL
https://cdn.shopify.com/s/files/1/1605/0171/products/lightweight-rain-shell-black-1_large.jpg?v=1598583976
£178.00
--------------------------------------------------------------------------------
ALL COUNTRY HOODIE
https://cdn.shopify.com/s/files/1/1605/0171/products/all-country-hoodie-white-1_large.jpg?v=1598583978
£148.00
--------------------------------------------------------------------------------

...and so on.

编辑(另存为CSV):

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://uk.octobersveryown.com/collections/all?page={page}&view=pagination-ajax'

page = 1
all_data = []
while True:
    soup = BeautifulSoup( requests.get(url.format(page=page)).content, 'html.parser' )

    li = soup.find_all('li', recursive=False)
    if not li:
        break

    for l in li:
        d = {'name': l.select_one('p a').get_text(strip=True),
             'link': 'https:' + l.img['src'],
             'price': l.select_one('.grid-price').get_text(strip=True, separator=' ')}
        all_data.append(d)
        print(d)
        print('-' * 80)

    page += 1

df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)

打印:

                                  name  ...            price
0               LIGHTWEIGHT RAIN SHELL  ...          £178.00
1               LIGHTWEIGHT RAIN SHELL  ...          £178.00
2                   ALL COUNTRY HOODIE  ...          £148.00
3                   ALL COUNTRY HOODIE  ...          £148.00
4                   ALL COUNTRY HOODIE  ...          £148.00
..                                 ...  ...              ...
271  OVO ESSENTIALS LONGSLEEVE T-SHIRT  ...           £58.00
272                OVO ESSENTIALS POLO  ...           £68.00
273             OVO ESSENTIALS T-SHIRT  ...           £48.00
274                 OVO ESSENTIALS CAP  ...           £38.00
275           POM POM COTTON TWILL CAP  ...  £32.00 SOLD OUT

[276 rows x 3 columns]

并保存data.csv(来自LibreOffice的屏幕截图):

enter image description here