找不到BeautifulSoup网站抓取图片Flipkart网址列表

时间:2020-09-25 10:10:49

标签: pandas web-scraping beautifulsoup python-requests

我一直在尝试从flipkart.com网站上为太阳镜Raybay抓取图像。我无法提取图像的网址。代码如下:-有人可以帮我提供更正的代码吗?

url=requests.get("https://www.flipkart.com/search?q=raybay%20sunglasses%20for%20women&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off")
content=url.text
soup=BeautifulSoup(content,'lxml')
image_url = []
p=soup.find_all('img', {'class':'_3togXc'})
for item in p:
    imgdata=item.findChildren('alt src')
    for i in imgdata:
        image_url.append(i)

1 个答案:

答案 0 :(得分:0)

似乎页面内容是由javascript动态呈现的,所以我建议将BeautifulSoup与Selenium结合起来

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')

wd = webdriver.Chrome('<PATH_TO_SELENIUM_WEBDRIVER>', options=chrome_options)

url = 'https://www.flipkart.com/search?q=raybay%20sunglasses%20for%20women&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off'

# load page via selenium
wd.get(url)

# wait 5 seconds until results table will be loaded
table = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.CLASS_NAME, '_1HmYoV')))

# parse content of the grid
soup = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')

# extract image URLs
image_url = [image['src'] for image in soup.find_all('img', {'class':'_3togXc'})]
        
print(image_url)