我试图解析[网站] [1]
中的表格[1]:http://www.espncricinfo.com/rankings/content/page/211270.html使用硒,因为我是初学者。我在努力做到这一点就是我的代码
from bs4 import BeautifulSoup
import time
from selenium import webdriver
url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")
print(len(soup.find_all("table")))
print(soup.find("table", {"class": "expanded_standings"}))
browser.close()
browser.quit()
我尝试过,我无法从中获取任何内容,任何建议都会非常有用,谢谢
答案 0 :(得分:0)
看起来该页面的表格在iframe中。如果您要抓一个特定的表,请尝试使用浏览器开发工具检查它(右键单击,检查Chrome中的元素)并找到包装它的iframe元素。 iframe应具有src
属性,该属性包含实际包含该表的网页的网址。然后,您可以使用与您尝试的方法类似的方法,而是使用src
网址。
如果你知道如何在页面的源代码中找到iframe,那么Selenium也可以“跳入”iframe。
frame = browser.find_element_by_id("the_iframe_id")
browser.switch_to.frame(frame)
html = browser.page_source
等
答案 1 :(得分:0)
您所追求的表格在iframe
之内。因此,要从该表中获取数据,您需要先切换iframe
,然后再进行其余操作。这是你可以做到的一种方式:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.espncricinfo.com/rankings/content/page/211270.html")
wait = WebDriverWait(driver, 10)
## if any different table you expect to have then just change the index number within nth-of-type()
## and the appropriate name in the selector
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "iframe[name='testbat']:nth-of-type(1)")))
for table in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table tr")))[1:]:
data = [item.text for item in table.find_elements_by_css_selector("th,td")]
print(data)
driver.quit()
最好的方法是在这种情况下如下。没有使用浏览器模拟器。仅使用requests
和BeautifulSoup
:
import requests
from bs4 import BeautifulSoup
res = requests.get("http://www.espncricinfo.com/rankings/content/page/211270.html")
soup = BeautifulSoup(res.text,"lxml")
## if any different table you expect to have then just change the index number
## and the appropriate name in the selector
item = soup.select("iframe[name='testbat']")[0]['src']
req = requests.get(item)
sauce = BeautifulSoup(req.text,"lxml")
for items in sauce.select("table tr"):
data = [item.text for item in items.select("th,td")]
print(data)
部分结果:
['Rank', 'Name', 'Country', 'Rating']
['1', 'S.P.D. Smith', 'AUS', '947']
['2', 'V. Kohli', 'IND', '912']
['3', 'J.E. Root', 'ENG', '881']