我正在尝试抓取 Shopee 的产品名称、价格和图片。但是,我似乎无法提取图像。是因为html吗?我似乎无法在 dataImg 中找到图像类
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
driver =webdriver.Chrome('chromedriver')
products=[]
prices=[]
images=[]
driver.get('https://shopee.co.id/search?keyword=laptop')
content=driver.page_source
soup=BeautifulSoup(content)
soup
for link in soup.find_all('div',class_="_3EfFTx"):
print('test')
print(link)
for link in soup.find_all('div',class_="_3EfFTx"):
#print(link)
dataImg=link.find('img',class_="_1T9dHf V1Fpl5")
print(dataImg)
name=link.find('div',class_="_1Sxpvs")
#print(name.get_text())
price=link.find('div',class_="QmqjGn")
#print(price.get_text())
if dataImg is not None:
products.append(name.get_text())
prices.append(price.get_text())
images.append(dataImg['src'])
df=pd.DataFrame({'Product Name':products,'Price':prices,'Images':images})
df
答案 0 :(得分:0)
该网站使用 JS 加载图像,要绕过这个,您需要有小延迟的 selenium。下面是下载图片src的代码:
from selenium import webdriver
from time import sleep
products=[]
prices=[]
images=[]
driver = webdriver.Chrome(r'F:\Sonstiges\chromedriver\chromedriver.exe')
driver.get('https://shopee.co.id/search?keyword=laptop')
sleep(8)
imgs = driver.find_elements_by_class_name('_1T9dHf')
for img in imgs:
img_url = img.get_attribute("src")
if img_url:
print(img_url)
driver.quit()
为了获取图像,只需使用获取的 URI 执行 this。如果你使用 Beautiful Soup 只是因为它在后台运行,is here the soloution 用于运行 selenium headless(在后台)。
答案 1 :(得分:0)
您在未加载所有内容的情况下获取源。如果您等待更长时间,这将无济于事,因为只会加载第一张图片,其余图片只有在它们出现时才会加载。
您必须稍等片刻,然后逐步向下滚动到页面底部:
time.sleep(5)
for i in range(10):
driver.execute_script("window.scrollBy(0, 350)")
time.sleep(1)
示例
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver =webdriver.Chrome('chromedriver')
products=[]
prices=[]
images=[]
driver.get('https://shopee.co.id/search?keyword=laptop')
time.sleep(5)
for i in range(10):
driver.execute_script("window.scrollBy(0, 350)")
time.sleep(1)
content=driver.page_source
soup=BeautifulSoup(content)
for item in soup.select('div[data-sqe="item"]'):
dataImg=item.img
name=item.find('div',class_="_1Sxpvs")
price=item.find('div',class_="QmqjGn")
if dataImg is not None:
products.append(name.get_text())
prices.append(price.get_text())
images.append(dataImg['src'])
df=pd.DataFrame({'Product Name':products,'Price':prices,'Images':images})
df
输出
Product Name Price Images
0 [ACQ] Meja Laptop Lipat Portable Rp51.990 https://cf.shopee.co.id/file/83a9e6e8ecad7a3db...
1 LENOVO Thinkpad CORE i5 Ram 8GB/ 2TB/1TB/500GB... Rp2.100.000 - Rp4.200.000 https://cf.shopee.co.id/file/44fbc24f5c585cda1...
2 HP Laptop 14s-cf3076TU/i3-1005G1/256GB SSD/14"... Rp6.599.000Rp6.598.999 https://cf.shopee.co.id/file/170a45679aa5002f1...
...