我试图通过以下链接抓取Totals
表格中的玩家统计信息:http://www.basketball-reference.com/players/j/jordami01.html。当您第一次出现在该网站上时,抓取数据要困难得多,因此您可以选择点击“CSV'就在桌子上方。这种格式更容易消化。
我遇到了麻烦
import urllib2
from bs4 import BeautifulSoup
from selenium import webdriver
player_link = "http://www.basketball-reference.com/players/j/jordami01.html"
browser = webdriver.Firefox()
browser.get(player_link)
elem = browser.find_element_by_xpath("//span[@class='tooltip' and @onlick='table2csv('totals')']")
elem.click()
当我运行它时,会弹出一个Firefox窗口,但代码永远不会将表格从原始格式更改为CSV格式。单击CSV(显然)后,CSV表格仅在源代码中弹出。如何让selenium
点击该CSV按钮,然后点击BS来抓取数据?
答案 0 :(得分:2)
您这里不需要BeautifulSoup
。点击带有selenium的CSV
按钮,使用CSV数据提取已显示的pre
元素的内容,并使用built-in csv
module:解析
import csv
from StringIO import StringIO
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
player_link = "http://www.basketball-reference.com/players/j/jordami01.html"
browser = webdriver.Firefox()
wait = WebDriverWait(browser, 10)
browser.set_page_load_timeout(10)
# stop load after a timeout
try:
browser.get(player_link)
except TimeoutException:
browser.execute_script("window.stop();")
# click "CSV"
elem = wait.until(EC.presence_of_element_located((By.XPATH, "//div[@class='table_heading']//span[. = 'CSV']")))
elem.click()
# get CSV data
csv_data = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "pre#csv_totals"))).text.encode("utf-8")
browser.close()
# read CSV
reader = csv.reader(StringIO(csv_data))
for line in reader:
print(line)