如何使用Selenium从网站解析表格内容?

时间:2018-02-07 04:00:46

标签: python python-3.x selenium parsing beautifulsoup

我试图将体育网站中的表格解析为字典列表以呈现模板,这是我第一次接触到硒,我试着阅读selenium文档并编写了这个程序

from bs4 import BeautifulSoup
import time
from selenium import webdriver

url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()

browser.get(url)
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")

print(len(soup.find_all("table")))
print(soup.find("table", {"class": "ratingstable"}))

browser.close()
browser.quit()

我将值变为0且没有,我如何修改以获取表的所有值并将其存储在字典列表中?如果您有任何其他问题,请随时提出。

1 个答案:

答案 0 :(得分:0)

首先,避免使用time.sleep()。这是违反所有最佳做法的。使用Explicit Wait

如果您检查表格,则可以看到它位于<iframe>标记内name="testbat"的位置。因此,您必须切换到该帧以获取表的内容。可以这样做:

browser.switch_to.default_content()
browser.switch_to.frame('testbat')

切换帧后,使用上面提到的显式等待。

完整代码:

from bs4 import BeautifulSoup
from selenium import webdriver

# Add the following imports to your program
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)

browser.switch_to.default_content()
browser.switch_to.frame('testbat')

try:
    WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'ratingstable')))
except TimeoutException:
    pass  # Handle the time out exception

html = browser.find_element_by_class_name('ratingstable').get_attribute('innerHTML')
soup = BeautifulSoup(html, "lxml")

您可以查看您是否已获得该表:

>>> print('S.P.D. Smith' in html)
True