我一直在尝试从flipkart.com网站上为太阳镜Raybay抓取图像。我无法提取图像的网址。代码如下:-有人可以帮我提供更正的代码吗?
url=requests.get("https://www.flipkart.com/search?q=raybay%20sunglasses%20for%20women&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off")
content=url.text
soup=BeautifulSoup(content,'lxml')
image_url = []
p=soup.find_all('img', {'class':'_3togXc'})
for item in p:
imgdata=item.findChildren('alt src')
for i in imgdata:
image_url.append(i)
答案 0 :(得分:0)
似乎页面内容是由javascript动态呈现的,所以我建议将BeautifulSoup与Selenium结合起来
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
wd = webdriver.Chrome('<PATH_TO_SELENIUM_WEBDRIVER>', options=chrome_options)
url = 'https://www.flipkart.com/search?q=raybay%20sunglasses%20for%20women&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off'
# load page via selenium
wd.get(url)
# wait 5 seconds until results table will be loaded
table = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.CLASS_NAME, '_1HmYoV')))
# parse content of the grid
soup = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
# extract image URLs
image_url = [image['src'] for image in soup.find_all('img', {'class':'_3togXc'})]
print(image_url)