点击按钮,然后在看似静态的网页上抓取数据?

时间:2016-07-20 01:18:39

标签: python selenium

我试图通过以下链接抓取Totals表格中的玩家统计信息:http://www.basketball-reference.com/players/j/jordami01.html。当您第一次出现在该网站上时,抓取数据要困难得多,因此您可以选择点击“CSV'就在桌子上方。这种格式更容易消化。

我遇到了麻烦

import urllib2
from bs4 import BeautifulSoup
from selenium import webdriver

player_link = "http://www.basketball-reference.com/players/j/jordami01.html"

browser = webdriver.Firefox()
browser.get(player_link)
elem = browser.find_element_by_xpath("//span[@class='tooltip' and @onlick='table2csv('totals')']")
elem.click()

当我运行它时,会弹出一个Firefox窗口,但代码永远不会将表格从原始格式更改为CSV格式。单击CSV(显然)后,CSV表格仅在源代码中弹出。如何让selenium点击该CSV按钮,然后点击BS来抓取数据?

1 个答案:

答案 0 :(得分:2)

您这里不需要BeautifulSoup 。点击带有selenium的CSV按钮,使用CSV数据提取已显示的pre元素的内容,并使用built-in csv module:解析

import csv
from StringIO import StringIO

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

player_link = "http://www.basketball-reference.com/players/j/jordami01.html"

browser = webdriver.Firefox()
wait = WebDriverWait(browser, 10)
browser.set_page_load_timeout(10)

# stop load after a timeout
try:
    browser.get(player_link)
except TimeoutException:
    browser.execute_script("window.stop();")

# click "CSV"
elem = wait.until(EC.presence_of_element_located((By.XPATH,  "//div[@class='table_heading']//span[. = 'CSV']")))
elem.click()

# get CSV data
csv_data = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "pre#csv_totals"))).text.encode("utf-8")
browser.close()

# read CSV
reader = csv.reader(StringIO(csv_data))
for line in reader:
    print(line)