我正在尝试从本网站的数据表中抓取数据:[http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-744793]
该网站有多个标签,可以更改html(我正在使用' matchup'标签)。在该匹配选项卡中,有一个下拉菜单可以更改我尝试访问的数据表。我想要访问的表中的项目是' li'无序列表中的标签。我只是想从"整体"中删除数据。下拉菜单的类别。
我无法访问我想要的数据。我尝试访问的项目将以“无类型”的形式返回。有没有办法做到这一点?
url = "http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-
744793"
html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')
dataList = []
for ultag in soup.find_all('ul', {'class': 'base-list team-stats'}):
print(ultag)
for iltag in ultag.find_all('li'):
dataList.append(iltag.get_text())
答案 0 :(得分:0)
所以问题是您尝试从中提取数据的选项卡的内容是使用React JS动态加载的。因此,您必须使用Python中的selenium模块打开浏览器以单击列表元素" Matchup"然后以编程方式单击它后获取源。
在我的Mac上,我使用以下说明安装了selenium和chromewebdriver:
https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f
然后签署了python文件,以便OS X防火墙在尝试运行时不会向我们抱怨,使用以下指令: Add Python to OS X Firewall Options?
然后运行以下python3代码:
import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup as soup
# Setup Selenium Chrome Web Driver
chromedriver = "/usr/local/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
# Navigate in Chrome to specified page.
driver.get("http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-744793")
# Find the matchup list element using a css selector and click it.
link = driver.find_element_by_css_selector("li[id='react-tabs-0'").click()
# Wait for content to load
time.sleep(1)
# Get the current page source.
source = driver.page_source
# Parse into soup() the source of the page after the link is clicked and use "html.parser" as the Parser.
soupify = soup(source, 'html.parser')
dataList = []
for ultag in soupify.find_all('ul', {'class': 'base-list team-stats'}):
print(ultag)
for iltag in ultag.find_all('li'):
dataList.append(iltag.get_text())
# We are done with the driver so quit.
driver.quit()
希望这会有所帮助,因为我注意到这与我刚刚解决的问题类似 - Beautifulsoup doesn't reach a child element