抓取工具未收到图片网址“美丽汤”

时间:2020-07-03 11:28:13

标签: python selenium selenium-webdriver web-scraping beautifulsoup

我的抓取工具不能始终如一地在页面上获取图像URL。有时确实如此,大多数时候却没有。如果找不到网址,这就是我在CSV中得到的内容:data:

我看不到出什么问题了,有人可以帮忙吗?

我尝试增加睡眠时间,以确保页面上的所有元素均已加载,我在同一网站上拥有其他页面,这是同一件事,有时它有时起作用,有时却不起作用。

我应该使用其他方法将元素拾取到 session_image = session_soup.img['src']吗?

我也多次使用这种方法来抓取其他网站,但从未遇到过此问题。与这个特定的网站有关吗?

我的代码:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as browser_wait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import ElementClickInterceptedException
import time
import re
import csv

# initialize the chrome browser
browser = webdriver.Chrome(executable_path=r'./chromedriver')
browser.implicitly_wait(20)

# URL
class_pass_url = 'https://www.classpass.com'

# Create file and writes the first row, added encoding type as write was giving errors
f = open('ClassPass.csv', 'w', encoding='utf-8')
headers = 'IMAGE URL\n'
f.write(headers)

# classpass results page
page = "https://classpass.com/studios/sum-yoga-london"

browser.get(page)

# Browser waits

#browser_wait(browser, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "line")))
time.sleep(4)

# Scrolls to bottom of page to reveal all classes
# browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

studio_page_source = browser.page_source
studio_soup = soup(studio_page_source, "html.parser")

try:
    studio_name = studio_soup.h2.text
except (AttributeError, TypeError,) as e:
    pass

sessions = studio_soup.find_all('h3', {'class': '_4Fnd4DwToJFbbU5jAfqSv'})

for session in sessions:
    twitter, facebook, instagram, session_website, telnumber, session_description = '', '', '', '', '', ''
    session_link = class_pass_url + session.a['href']
    browser.get(session_link)

    #browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    #browser.execute_script("window.scrollTo(0,0);")
    #time.sleep(2)

    browser_wait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, '_1ruz3nW6mOnylv99BOA_tm')))

    # parses individual class page
    session_page_source = browser.page_source
    session_soup = soup(session_page_source, "html.parser")


    try:
        session_image = session_soup.img['src']
    except (AttributeError, TypeError,) as e:
        pass

    print(session_image)

    f.write(

        session_image +

        "\n")
    

1 个答案:

答案 0 :(得分:0)

要获取图像,可以在代码中进行以下更改:

session_image = session_soup.find('meta', {'property': "og:image"})
session_image = session_image.get('content')