Question

我正试图从财富最好的100家公司中提取一些信息，以便为链接工作。

我实际上正在浏览每家公司并提取信息。以下是代码：

import datetime
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from urllib.request import urlopen
from selenium import webdriver
import time

init_url='http://fortune.com/best-companies/google-alphabet-1/'


i=1
while i<=4:
    page=urlopen(init_url)
    soup=BeautifulSoup(page,'html.parser')
    first_table=soup.find('table',{"class":"company-data-table"})
    th1=first_table.find('th',text='Industry')
    td1=th1.findNext('td')
    print(td1.text)
    th2=first_table.find('th',text='Type of organization')
    td2=th2.findNext('td')
    print(td2.text)

    driver=webdriver.Firefox()
    driver.get(init_url)
    time.sleep(5)
    elem1=driver.find_element_by_link_text("Next Company")
    elem1.click()
    init_url=driver.current_url
    driver.quit()

    i+=1

但是，这段代码不断给我这个错误：

Traceback (most recent call last):
  File "C:/Users/pc/Desktop/panda_try.py", line 28, in <module>
    elem1.click()
  File "C:\Users\pc\AppData\Local\Programs\Python\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 77, in click
    self._execute(Command.CLICK_ELEMENT)
  File "C:\Users\pc\AppData\Local\Programs\Python\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 494, in _execute
    return self._parent.execute(command, params)
  File "C:\Users\pc\AppData\Local\Programs\Python\Python35-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 236, in execute
    self.error_handler.check_response(response)
  File "C:\Users\pc\AppData\Local\Programs\Python\Python35-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 192, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotVisibleException: Message: Element is not visible

我应该如何纠正这个问题？我在这方面与时间竞争，任何帮助将不胜感激。谢谢！

Answer 1

有多个元素匹配＆＃34;链接文本＆＃34;定位器。您应该过滤可见链接，然后单击它：

for link in driver.find_elements_by_link_text("Next Company"):
    if link.is_displayed():
        link.click()
        break

或者，另一种可能有用的方式，并且通过扩展名替换不可靠的time.sleep()是Explicit Wait和element_to_be_clickable预期条件：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver.get(init_url)

wait = WebDriverWait(driver, 10)
link = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, "Next Company")))
link.click()

Answer 2

在这种情况下我会使用的是XPath选择器 WebDriverWait表示必须单击的元素。我也进行了一些更改，例如加载浏览器一次，这样可以更快地运行任务。在我的情况下，selenium无法与最新的Firefox一起运行，因此我使用了较旧的selenium版本（2.49）和Firefox 33，它在加载Web驱动程序时使用FirefoxBinary设置。

from bs4 import BeautifulSoup
from urllib2 import urlopen
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

init_url = 'http://fortune.com/best-companies/google-alphabet-1/'
next_company_xpath = "//article[contains(@class, 'current')]//div[contains(@class, 'pagination')]//a[contains(.,'Next Company')]"

# Load the webdriver
driver = webdriver.Firefox(firefox_binary=FirefoxBinary('firefox/firefox'))
driver.set_window_size(1980, 1080)
driver.get(init_url)

i = 1
while i <= 4:
    page = urlopen(init_url)
    soup = BeautifulSoup(page, 'html.parser')
    first_table = soup.find('table', {"class": "company-data-table"})
    th1 = first_table.find('th', text='Industry')
    td1 = th1.findNext('td')
    print(td1.text)
    th2 = first_table.find('th', text='Type of organization')
    td2 = th2.findNext('td')
    print(td2.text)

    wait = WebDriverWait(driver, 10)
    link = wait.until(EC.element_to_be_clickable((By.XPATH, next_company_xpath)))
    link.click()
    init_url = driver.current_url

    i += 1

driver.close()

网页抓取时如何使元素可见？

2 个答案: