使用硒和beautifulsoup进行网页抓取

时间:2020-07-17 09:51:21

标签: python selenium web-scraping

我正在尝试在网上刮除grofer和bigbasket信息,但是我在findAll()函数方面遇到了麻烦。当我使用len(imgList)时,长度总是返回0。它总是显示空列表如何解决?有人可以帮我吗?我在grofer中得到staus代码403

from bs4 import BeautifulSoup
url = 'https://grofers.com/cn/grocery-staples/cid/16'
driver = webdriver.Chrome(r'C:\Users\HP\data\chromedriver.exe')
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
data = soup.findAll('plp-product__name')
print(data)
from bs4 import BeautifulSoup
response = requests.get('https://grofers.com/cn/grocery-staples/cid/16')
response
content = response.content
data = BeautifulSoup(content,'html5lib')
read = data.findAll('plp-product__name ')
read```
在输出中我得到: []

1 个答案:

答案 0 :(得分:0)

您还没有加入

import numpy as np
from PIL import Image
import os


new_dir = "dta_npy"
directory = r"C:\Desktop\Université_2019_2020\CoursS2_Mosef\Stage\Data\Grand_Leez\shp\imagettes"
Data_dir = os.path.join(directory, new_dir)
os.makedirs(Data_dir)
print("Directory '%s' created" %Data_dir)

Categories = ["Bouleau_tif","Chene_tif", "Erable_tif", "Frene_tif", "Peuplier_tif"]

for categorie in Categories:
    path = os.path.join(directory,categorie) #path for each species
    for img in os.listdir(path):
        im = Image.open(os.path.join(path,img)) #load an image file
        imarray = np.array(im) # convert it to a matrix
        imarray = np.delete(imarray, 3, axis=2)
        unique_name=img
        unique_name = unique_name.split(".")
        unique_name = unique_name[0]
        np.save(Data_dir+"/"+unique_name, imarray)
        

尝试

from selenium import webdriver 
driver = webdriver.Chrome(executable_path=r'C:\Users\HP\data\chromedriver.exe')

或者

data = soup.select('div.plp-product__name ')

请注意,正确的方法是data = soup.find_all("div",class_="plp-product__name") ,而不是find_all,因为bs4库中已弃用该方法。