网页抓取 Shopee(导入图片网址)

时间:2021-02-06 02:37:16

标签: python pandas selenium web-scraping beautifulsoup

我正在尝试抓取 Shopee 的产品名称、价格和图片。但是,我似乎无法提取图像。是因为html吗?我似乎无法在 dataImg 中找到图像类

import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

driver =webdriver.Chrome('chromedriver')

products=[]
prices=[]
images=[]

driver.get('https://shopee.co.id/search?keyword=laptop')

content=driver.page_source
soup=BeautifulSoup(content)
soup

for link in soup.find_all('div',class_="_3EfFTx"):
    print('test')
    print(link)

for link in soup.find_all('div',class_="_3EfFTx"):
    #print(link)
    dataImg=link.find('img',class_="_1T9dHf V1Fpl5")
    print(dataImg)
    name=link.find('div',class_="_1Sxpvs")
    #print(name.get_text())
    price=link.find('div',class_="QmqjGn")
    #print(price.get_text())
    
    if dataImg is not None:
        products.append(name.get_text())
        prices.append(price.get_text())
        images.append(dataImg['src'])

df=pd.DataFrame({'Product Name':products,'Price':prices,'Images':images})
df

2 个答案:

答案 0 :(得分:0)

该网站使用 JS 加载图像,要绕过这个,您需要有小延迟的 selenium。下面是下载图片src的代码:

from selenium import webdriver
from time import sleep

products=[]
prices=[]
images=[]

driver = webdriver.Chrome(r'F:\Sonstiges\chromedriver\chromedriver.exe')
driver.get('https://shopee.co.id/search?keyword=laptop')

sleep(8)
imgs = driver.find_elements_by_class_name('_1T9dHf')
for img in imgs:
    img_url = img.get_attribute("src")
    if img_url:
        print(img_url)
driver.quit()

为了获取图像,只需使用获取的 URI 执行 this。如果你使用 Beautiful Soup 只是因为它在后台运行,is here the soloution 用于运行 selenium headless(在后台)。

答案 1 :(得分:0)

会发生什么?

您在未加载所有内容的情况下获取源。如果您等待更长时间,这将无济于事,因为只会加载第一张图片,其余图片只有在它们出现时才会加载。

如何解决这个问题?

您必须稍等片刻,然后逐步向下滚动到页面底部:

time.sleep(5)
for i in range(10):
    driver.execute_script("window.scrollBy(0, 350)")
    time.sleep(1) 

示例

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
   
driver =webdriver.Chrome('chromedriver')

products=[]
prices=[]
images=[]

driver.get('https://shopee.co.id/search?keyword=laptop')

time.sleep(5)
for i in range(10):
    driver.execute_script("window.scrollBy(0, 350)")
    time.sleep(1)
    
content=driver.page_source
soup=BeautifulSoup(content)

for item in soup.select('div[data-sqe="item"]'):
    dataImg=item.img
    name=item.find('div',class_="_1Sxpvs")
    price=item.find('div',class_="QmqjGn")
    
    if dataImg is not None:
        products.append(name.get_text())
        prices.append(price.get_text())
        images.append(dataImg['src'])

df=pd.DataFrame({'Product Name':products,'Price':prices,'Images':images})
df 

输出

Product Name    Price   Images
0   [ACQ] Meja Laptop Lipat Portable    Rp51.990    https://cf.shopee.co.id/file/83a9e6e8ecad7a3db...
1   LENOVO Thinkpad CORE i5 Ram 8GB/ 2TB/1TB/500GB...   Rp2.100.000 - Rp4.200.000   https://cf.shopee.co.id/file/44fbc24f5c585cda1...
2   HP Laptop 14s-cf3076TU/i3-1005G1/256GB SSD/14"...   Rp6.599.000Rp6.598.999  https://cf.shopee.co.id/file/170a45679aa5002f1...
...