如何使用Selenium从网站解析表数据?

时间:2018-02-07 05:35:53

标签: python python-3.x selenium parsing web-scraping

我试图解析[网站] [1]

中的表格

[1]:http://www.espncricinfo.com/rankings/content/page/211270.html使用硒,因为我是初学者。我在努力做到这一点就是我的代码

from bs4 import BeautifulSoup
import time
from selenium import webdriver

url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()

browser.get(url)
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")

print(len(soup.find_all("table")))
print(soup.find("table", {"class": "expanded_standings"}))

browser.close()
browser.quit()

我尝试过,我无法从中获取任何内容,任何建议都会非常有用,谢谢

2 个答案:

答案 0 :(得分:0)

看起来该页面的表格在iframe中。如果您要抓一个特定的表,请尝试使用浏览器开发工具检查它(右键单击,检查Chrome中的元素)并找到包装它的iframe元素。 iframe应具有src属性,该属性包含实际包含该表的网页的网址。然后,您可以使用与您尝试的方法类似的方法,而是使用src网址。

如果你知道如何在页面的源代码中找到iframe,那么Selenium也可以“跳入”iframe。 frame = browser.find_element_by_id("the_iframe_id") browser.switch_to.frame(frame) html = browser.page_source

答案 1 :(得分:0)

您所追求的表格在iframe之内。因此,要从该表中获取数据,您需要先切换iframe,然后再进行其余操作。这是你可以做到的一种方式:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.espncricinfo.com/rankings/content/page/211270.html")
wait = WebDriverWait(driver, 10)
 ## if any different table you expect to have then just change the index number within nth-of-type()
 ## and the appropriate name in the selector
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "iframe[name='testbat']:nth-of-type(1)")))
for table in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table tr")))[1:]:
    data = [item.text for item in table.find_elements_by_css_selector("th,td")]
    print(data)
driver.quit()

最好的方法是在这种情况下如下。没有使用浏览器模拟器。仅使用requestsBeautifulSoup

import requests
from bs4 import BeautifulSoup

res = requests.get("http://www.espncricinfo.com/rankings/content/page/211270.html")
soup = BeautifulSoup(res.text,"lxml")
 ## if any different table you expect to have then just change the index number 
 ## and the appropriate name in the selector
item = soup.select("iframe[name='testbat']")[0]['src']
req = requests.get(item)
sauce = BeautifulSoup(req.text,"lxml")
for items in sauce.select("table tr"):
    data = [item.text for item in items.select("th,td")]
    print(data)

部分结果:

['Rank', 'Name', 'Country', 'Rating']
['1', 'S.P.D. Smith', 'AUS', '947']
['2', 'V. Kohli', 'IND', '912']
['3', 'J.E. Root', 'ENG', '881']