我是一个完整的初学者,我在网络抓取中遇到了一些问题,我能够抓取图片,标题和价格。并且成功获取了[0]
处的索引。但是,每当我尝试运行循环或以大于0的值对索引进行硬编码时,它都表明它超出范围。并且不会刮擦其他任何<li>
标签。还有其他方法可以解决此问题吗?另外,我合并了硒以便加载整个页面。任何帮助将不胜感激。
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://ca.octobersveryown.com/collections/all")
scrolls = 22
while True:
scrolls -= 1
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(0.2)
if scrolls < 0:
break
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
bodies = (soup.find(id='content'))
clothing = bodies.find_all('ul', class_='grid--full product-grid-items')
for span_tag in soup.findAll(class_='visually-hidden'):
span_tag.replace_with('')
print(clothing[0].find('img')['src'])
print(clothing[0].find(class_='product-title').get_text())
print(clothing[0].find(class_='grid-price-money').get_text())
time.sleep(8)
driver.quit()
答案 0 :(得分:0)
如果只想使用BeautifulSoup
而不使用硒,则可以模拟该页面发出的Ajax请求。例如:
import requests
from bs4 import BeautifulSoup
url = 'https://uk.octobersveryown.com/collections/all?page={page}&view=pagination-ajax'
page = 1
while True:
soup = BeautifulSoup( requests.get(url.format(page=page)).content, 'html.parser' )
li = soup.find_all('li', recursive=False)
if not li:
break
for l in li:
print(l.select_one('p a').get_text(strip=True))
print('https:' + l.img['src'])
print(l.select_one('.grid-price').get_text(strip=True, separator=' '))
print('-' * 80)
page += 1
打印:
LIGHTWEIGHT RAIN SHELL
https://cdn.shopify.com/s/files/1/1605/0171/products/lightweight-rain-shell-dark-red-1_large.jpg?v=1598583974
£178.00
--------------------------------------------------------------------------------
LIGHTWEIGHT RAIN SHELL
https://cdn.shopify.com/s/files/1/1605/0171/products/lightweight-rain-shell-black-1_large.jpg?v=1598583976
£178.00
--------------------------------------------------------------------------------
ALL COUNTRY HOODIE
https://cdn.shopify.com/s/files/1/1605/0171/products/all-country-hoodie-white-1_large.jpg?v=1598583978
£148.00
--------------------------------------------------------------------------------
...and so on.
编辑(另存为CSV):
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://uk.octobersveryown.com/collections/all?page={page}&view=pagination-ajax'
page = 1
all_data = []
while True:
soup = BeautifulSoup( requests.get(url.format(page=page)).content, 'html.parser' )
li = soup.find_all('li', recursive=False)
if not li:
break
for l in li:
d = {'name': l.select_one('p a').get_text(strip=True),
'link': 'https:' + l.img['src'],
'price': l.select_one('.grid-price').get_text(strip=True, separator=' ')}
all_data.append(d)
print(d)
print('-' * 80)
page += 1
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
打印:
name ... price
0 LIGHTWEIGHT RAIN SHELL ... £178.00
1 LIGHTWEIGHT RAIN SHELL ... £178.00
2 ALL COUNTRY HOODIE ... £148.00
3 ALL COUNTRY HOODIE ... £148.00
4 ALL COUNTRY HOODIE ... £148.00
.. ... ... ...
271 OVO ESSENTIALS LONGSLEEVE T-SHIRT ... £58.00
272 OVO ESSENTIALS POLO ... £68.00
273 OVO ESSENTIALS T-SHIRT ... £48.00
274 OVO ESSENTIALS CAP ... £38.00
275 POM POM COTTON TWILL CAP ... £32.00 SOLD OUT
[276 rows x 3 columns]
并保存data.csv
(来自LibreOffice的屏幕截图):