Scrape' li'数据表中的标记,根据下拉菜单进行更改

时间:2017-08-25 23:58:01

标签: python html web-scraping

我正在尝试从本网站的数据表中抓取数据:[http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-744793]

该网站有多个标签,可以更改html(我正在使用' matchup'标签)。在该匹配选项卡中,有一个下拉菜单可以更改我尝试访问的数据表。我想要访问的表中的项目是' li'无序列表中的标签。我只是想从"整体"中删除数据。下拉菜单的类别。

我无法访问我想要的数据。我尝试访问的项目将以“无类型”的形式返回。有没有办法做到这一点?

url = "http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-
744793"  
html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')

dataList = []
for ultag in soup.find_all('ul', {'class': 'base-list team-stats'}):
    print(ultag)
    for iltag in ultag.find_all('li'):
        dataList.append(iltag.get_text())

1 个答案:

答案 0 :(得分:0)

所以问题是您尝试从中提取数据的选项卡的内容是使用React JS动态加载的。因此,您必须使用Python中的selenium模块打开浏览器以单击列表元素" Matchup"然后以编程方式单击它后获取源。

在我的Mac上,我使用以下说明安装了selenium和chromewebdriver:

https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f

然后签署了python文件,以便OS X防火墙在尝试运行时不会向我们抱怨,使用以下指令: Add Python to OS X Firewall Options?

然后运行以下python3代码:

import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup as soup

# Setup Selenium Chrome Web Driver
chromedriver = "/usr/local/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

# Navigate in Chrome to specified page.
driver.get("http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-744793")

# Find the matchup list element using a css selector and click it.
link = driver.find_element_by_css_selector("li[id='react-tabs-0'").click()

# Wait for content to load
time.sleep(1)

# Get the current page source.
source = driver.page_source

# Parse into soup() the source of the page after the link is clicked and use "html.parser" as the Parser.
soupify = soup(source, 'html.parser')

dataList = []
for ultag in soupify.find_all('ul', {'class': 'base-list team-stats'}):
    print(ultag)
    for iltag in ultag.find_all('li'):
        dataList.append(iltag.get_text())

# We are done with the driver so quit.
driver.quit()

希望这会有所帮助,因为我注意到这与我刚刚解决的问题类似 - Beautifulsoup doesn't reach a child element