无法从动态网页中获取链接

时间:2018-06-06 20:45:28

标签: python python-3.x selenium selenium-webdriver web-scraping

我已经在python中编写了一个与selenium结合使用的脚本来解析网页中的某个链接。该链接位于iframe之内。我已尝试切换到它,但无法从中读取内容以获取我之后的特定链接。

以下是如何到达目的地:

  1. 登录的链接是免费的。

  2. 登录后,网站会自动转到所需内容的第一页。

  3. 那里有几个名字(成员),其链接连接到他们的每个个人资料。

  4. 进入该个人资料页面后,有一个指向他们现有公司的链接(位于专业经验下),这是我想要解析的内容。

  5. 第一个个人资料中所需的链接(在专业经验下)看起来like this

    This is the log in link

    这是我迄今为止尝试过的脚本:

    from selenium import webdriver
    from urllib.parse import urljoin
    from selenium.webdriver.common.by import By
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    link = "https://www.xing.com"
    
    driver = webdriver.Chrome()
    driver.get("replace with above link")
    wait = WebDriverWait(driver, 10)
    
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#login_form_username"))).send_keys("user")
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#login_form_password"))).send_keys("pass",Keys.RETURN)
    
    links = [urljoin(link,items.find_element_by_css_selector(".user-name").get_attribute("href")) for items in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".contact")))]
    for link in links:
        driver.get(link)
        name = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h2 span"))).text
        wait.until(EC.frame_to_be_available_and_switch_to_it(driver.find_element_by_css_selector("#tab-content")))
        #I get timeout exception in the following line
        link = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".job-company-name a"))).text
        print(name,link)
    

    我不知道这是否有用。无论如何,link to the source

1 个答案:

答案 0 :(得分:0)

我似乎找到了解决问题的解决方案。我准备好了解答案如果出现更好的解决方案:

links = [urljoin(link,items.find_element_by_css_selector(".user-name").get_attribute("href")) for items in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".contact")))]
for link in links:
    driver.get(link)
    name = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h2 span"))).text
    ilink = driver.find_element_by_css_selector("#tab-content").get_attribute("src")
    driver.get(ilink)   #this is what I did to get around that
    try:
        link = driver.find_element_by_css_selector(".job-company-name a").text
    except Exception: link = ""
    print(name,link)

我没有切换到iframe,只是解析了iframe中的链接并使用了它。这不是我预期的解决方案,但它有效。