网页抓取时如何使元素可见?

时间:2016-12-19 20:35:50

标签: python python-3.x selenium selenium-webdriver

我正试图从财富最好的100家公司中提取一些信息,以便为链接工作。

我实际上正在浏览每家公司并提取信息。以下是代码:

import datetime
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from urllib.request import urlopen
from selenium import webdriver
import time

init_url='http://fortune.com/best-companies/google-alphabet-1/'


i=1
while i<=4:
    page=urlopen(init_url)
    soup=BeautifulSoup(page,'html.parser')
    first_table=soup.find('table',{"class":"company-data-table"})
    th1=first_table.find('th',text='Industry')
    td1=th1.findNext('td')
    print(td1.text)
    th2=first_table.find('th',text='Type of organization')
    td2=th2.findNext('td')
    print(td2.text)

    driver=webdriver.Firefox()
    driver.get(init_url)
    time.sleep(5)
    elem1=driver.find_element_by_link_text("Next Company")
    elem1.click()
    init_url=driver.current_url
    driver.quit()

    i+=1

但是,这段代码不断给我这个错误:

Traceback (most recent call last):
  File "C:/Users/pc/Desktop/panda_try.py", line 28, in <module>
    elem1.click()
  File "C:\Users\pc\AppData\Local\Programs\Python\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 77, in click
    self._execute(Command.CLICK_ELEMENT)
  File "C:\Users\pc\AppData\Local\Programs\Python\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 494, in _execute
    return self._parent.execute(command, params)
  File "C:\Users\pc\AppData\Local\Programs\Python\Python35-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 236, in execute
    self.error_handler.check_response(response)
  File "C:\Users\pc\AppData\Local\Programs\Python\Python35-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 192, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotVisibleException: Message: Element is not visible

我应该如何纠正这个问题?我在这方面与时间竞争,任何帮助将不胜感激。谢谢!

2 个答案:

答案 0 :(得分:1)

有多个元素匹配&#34;链接文本&#34;定位器。您应该过滤可见链接,然后单击它:

for link in driver.find_elements_by_link_text("Next Company"):
    if link.is_displayed():
        link.click()
        break

或者,另一种可能有用的方式,并且通过扩展名替换不可靠的time.sleep()Explicit Waitelement_to_be_clickable预期条件:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver.get(init_url)

wait = WebDriverWait(driver, 10)
link = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, "Next Company")))
link.click()

答案 1 :(得分:0)

在这种情况下我会使用的是XPath选择器 WebDriverWait表示必须单击的元素。我也进行了一些更改,例如加载浏览器一次,这样可以更快地运行任务。在我的情况下,selenium无法与最新的Firefox一起运行,因此我使用了较旧的selenium版本(2.49)和Firefox 33,它在加载Web驱动程序时使用FirefoxBinary设置。

from bs4 import BeautifulSoup
from urllib2 import urlopen
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

init_url = 'http://fortune.com/best-companies/google-alphabet-1/'
next_company_xpath = "//article[contains(@class, 'current')]//div[contains(@class, 'pagination')]//a[contains(.,'Next Company')]"

# Load the webdriver
driver = webdriver.Firefox(firefox_binary=FirefoxBinary('firefox/firefox'))
driver.set_window_size(1980, 1080)
driver.get(init_url)

i = 1
while i <= 4:
    page = urlopen(init_url)
    soup = BeautifulSoup(page, 'html.parser')
    first_table = soup.find('table', {"class": "company-data-table"})
    th1 = first_table.find('th', text='Industry')
    td1 = th1.findNext('td')
    print(td1.text)
    th2 = first_table.find('th', text='Type of organization')
    td2 = th2.findNext('td')
    print(td2.text)

    wait = WebDriverWait(driver, 10)
    link = wait.until(EC.element_to_be_clickable((By.XPATH, next_company_xpath)))
    link.click()
    init_url = driver.current_url

    i += 1

driver.close()