使用 selenium 抓取动态网页

时间:2020-12-24 21:39:10

标签: python-3.x selenium-webdriver web-scraping

我正在尝试获取此页面上帖子的链接,但它们显然是通过单击每个帖子图片生成的。我在 Python 3.8 中使用 Selenium 和 beautifulsoup4。 知道如何在 selenium 继续访问下一页时获取链接吗?

HTML Code

网址:https://www.goplaceit.com/cl/mapa?id_modalidad=1&tipo_pro//*[@id=%22gpi-property-list-container%22]/div[3]/div[1]/div[1]/imgpiedad=1%2C2&selectedTool=list#12/-33.45/-70.66667

点击图片后,它会打开一个新标签,其中包含以下类型的缩短网址:https://www.goplaceit.com/propiedad/6198212

将我发送到一个 url 类型:

https://www.goplaceit.com/cl/propiedad/venta/departamento/santiago/6198212-depto-con-1d-1b-y-terraza-a-pasos-del-metro-toesca-bodega

我的代码:

from bs4 import BeautifulSoup
from selenium import webdriver
import time
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import winsound
from timeit import default_timer as timer
from selenium.webdriver.common.keys import Keys
start = timer()

PROXY = "PROXY" # IP:PORT or HOST:PORT
path_to_extension = r"extension"
options = Options()
#options.add_argument("--incognito")
options.add_argument('load-extension=' + path_to_extension)
#options.add_argument('--disable-java')
options.headless = False
prefs = {"profile.default_content_setting_values.notifications" : 2}
prefs2 = {"profile.managed_default_content_settings.images": 2}
prefs.update(prefs2)
prefs3 = {"profile.default_content_settings.cookies": 2}
prefs.update(prefs3)
options.add_experimental_option("prefs",prefs)
options.add_argument("--start-maximized")
options.add_argument('--proxy-server=%s' % PROXY)
driver = webdriver.Chrome('chromedriver.exe', options=options)
driver.get('https://www.goplaceit.com/cl/')
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/nav/div/div[2]/div[1]/button'))).click()
correo = driver.find_element(By.XPATH, '//*[@id="email"]')
correo.send_keys("Mail")
contraseña = driver.find_element(By.XPATH, '//*[@id="password"]')
contraseña.send_keys("password")
contraseña.send_keys(Keys.ENTER)
time.sleep(7)


elem.driver.find_element(By.XPATH, '//*[@id="gpi-main-landing-search-input"]/div/input')
elem.click()
elem.send_keys("keywords")
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="gpi-main-landing-search-input"]/div/div[1]/ul/li[1]/a/div/div[1]'))).click()
buscador.send_keys(Keys.ENTER)
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/div/div/div[1]/div[2]/div/div[1]/div/div[1]/button'))).click()
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="custom-checkbox"]'))).click()

page_number = 0
max_page_number = 30
while page_number<=max_page_number:
    WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(),"paginator-btn-right")]'))).click()
    

1 个答案:

答案 0 :(得分:1)

您可以通过单击图片、保存您的网址、返回第一页并对所有图片重复此操作来轻松获取网址:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver

driver.get("https://www.goplaceit.com/cl/mapa?id_modalidad=1&tipo_propiedad=1%2C2&selectedTool=list#8/-33.958/-71.206")
images = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class='sc-iyvyFf ljSqTz']//img")))
urls = []
for i, image in enumerate(images):
    window_before = driver.window_handles[0]
    image.click()
    driver.implicitly_wait(2)
    window_after = driver.window_handles[i+1]
    driver.switch_to.window(window_after)
    urls.append(driver.current_url)
    driver.switch_to.window(window_before)