从页面主体中收集在初始加载时不可见的数据

时间:2019-06-19 19:03:01

标签: python selenium web-scraping beautifulsoup

我正试图用漂亮的汤从this网站上抓取数据。如果您向下滚动到单个播放部分,请点击“共享并更多>以csv的形式获取表格”,将显示该表格数据的CSV形式。如果我检查此CSV文本,我会看到它在<pre>标记中,并且ID为“ csv_all_plays”

我正在尝试使用python包beautifulsoup抓取此数据。我目前正在做的是

nfl_url = #the url I have linked above
driver = webdriver.Chrome(executable_path=r'C:/path/to/chrome/driver') 
driver.get(nfl_url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(id="csv_all_plays"))

这只会导致打印“无”。我知道,因为加载页面时不会显示此数据,这意味着我无法使用Requests包,而必须使用实际上可以获取整个页面来源的东西(我正在使用Selenium)。那不是我在这里做什么吗?无法获取CSV数据是另一个原因吗?

2 个答案:

答案 0 :(得分:2)

您可以使用selenium将鼠标悬停在“共享和更多”链接上以显示菜单,从中单击“以csv格式获取表格”:

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.pro-football-reference.com/play-index/play_finder.cgi?request=1&match=summary_all&year_min=2018&year_max=2018&game_type=R&game_num_min=0&game_num_max=99&week_num_min=0&week_num_max=99&quarter%5B%5D=4&minutes_max=15&seconds_max=00&minutes_min=00&seconds_min=00&down%5B%5D=0&down%5B%5D=1&down%5B%5D=2&down%5B%5D=3&down%5B%5D=4&field_pos_min_field=team&field_pos_max_field=team&end_field_pos_min_field=team&end_field_pos_max_field=team&type%5B%5D=PUNT&no_play=N&turnover_type%5B%5D=interception&turnover_type%5B%5D=fumble&score_type%5B%5D=touchdown&score_type%5B%5D=field_goal&score_type%5B%5D=safety&rush_direction%5B%5D=LE&rush_direction%5B%5D=LT&rush_direction%5B%5D=LG&rush_direction%5B%5D=M&rush_direction%5B%5D=RG&rush_direction%5B%5D=RT&rush_direction%5B%5D=RE&pass_location%5B%5D=SL&pass_location%5B%5D=SM&pass_location%5B%5D=SR&pass_location%5B%5D=DL&pass_location%5B%5D=DM&pass_location%5B%5D=DR&order_by=yards')
scroll = ActionChains(d).move_to_element(d.find_element_by_id('all_all_plays'))
scroll.perform()
spans = [i for i in d.find_elements_by_tag_name('span') if 'Share & more' in i.text]
hover = ActionChains(d).move_to_element(spans[-1])
hover.perform()
b = [i for i in d.find_elements_by_tag_name('button') if 'get table as csv' in i.text.lower()][0]
b.send_keys('\n')
csv_data = soup(d.page_source, 'html.parser').find('pre', {'id':'csv_all_plays'}).text

输出(由于SO的字符限制而缩短):

"\nDate,Tm,Opp,Quarter,Time,Down,ToGo,Location,Score,Detail,Yds,EPB,EPA,Diff,PYds,PRYds\n2018-09-09,Texans,Patriots,4,4:41,4,8,HTX 36,13-27,Trevor Daniel punts 47 yards muffed catch by Riley McCarron recovered by Johnson Bademosi and returned for no gain,0,-0.980,4.510,5.49,47,\n2018-09-09,Jaguars,Giants,4,0:54,4,6,JAX 40,20-15,Logan Cooke punts 41 yards muffed catch by Kaelin Clay recovered by Donald Payne and returned for no gain,0,-0.720,4.170,4.89,41,\n2018-09-09,Chiefs,Chargers,4,10:35,4,6,KAN 27,31-20,Dustin Colquitt punts 59 yards returned by JJ Jones for no gain. JJ Jones fumbles (forced by De'Anthony Thomas) recovered by James Winchester at LAC-2,0,-1.570,6.740,8.31,59,\n2018-09-23,Dolphins,Raiders,4,12:33,4,5,MIA 39,14-17,Matt Haack punts 42 yards muffed catch by Jordy Nelson recovered by Jordy Nelson and returned for no gain,0,-0.780,0.060,.84,42,\n2018-09-30,Jets,Jaguars,4,8:59,4,10,NYJ 14,12-25,Lac Edwards punts 46 yards muffed catch by Jaydon Mickens ball out of bounds at JAX-41,0,-2.470,-1.660,.81,46,\n2018-10-11,Giants,Eagles,4,12:27,4,17,NYG 33,13-34,Riley Dixon punts 50 yards muffed catch by DeAndre Carter recovered by DeAndre Carter and returned for no gain,0,-1.180,-0.040,1.14,50,\n2018-10-28,Jets,Bears,4,5:37,4,13,NYJ 37,10-24,Lac Edwards punts 48 yards muffed catch by Tarik Cohen recovered by Tarik Cohen and returned for no gain,0,-0.910,0.320,1.23,48,\n2018-11-25,Vikings,Packers,4,6:00,4,13,GNB 37,24-14,Matt Wile punts 21 yards muffed catch by Tramon Williams recovered by Marcus Sherels and returned for no gain,0,0.790,4.580,3.79,21,\n2018-12-13,Chiefs,Chargers,4,2:47,4,15,KAN 6,28-21,Dustin Colquitt punts 55 yards muffed catch by Desmond King recovered by Desmond King and returned for no gain,0,-2.490,-1.600,.89,55,

要将csv数据写入文件:

import csv
with open('individual_stats.csv', 'w') as f:
  write = csv.writer(f)
  write.writerows([list(filter(None, i.split(','))) for i in filter(None, csv_data.split('\n'))])

输出(前16行):

Date,Tm,Opp,Quarter,Time,Down,ToGo,Location,Score,Detail,Yds,EPB,EPA,Diff,PYds,PRYds
2018-09-09,Texans,Patriots,4,4:41,4,8,HTX 36,13-27,Trevor Daniel punts 47 yards muffed catch by Riley McCarron recovered by Johnson Bademosi and returned for no gain,0,-0.980,4.510,5.49,47
2018-09-09,Jaguars,Giants,4,0:54,4,6,JAX 40,20-15,Logan Cooke punts 41 yards muffed catch by Kaelin Clay recovered by Donald Payne and returned for no gain,0,-0.720,4.170,4.89,41
2018-09-09,Chiefs,Chargers,4,10:35,4,6,KAN 27,31-20,Dustin Colquitt punts 59 yards returned by JJ Jones for no gain. JJ Jones fumbles (forced by De'Anthony Thomas) recovered by James Winchester at LAC-2,0,-1.570,6.740,8.31,59
2018-09-23,Dolphins,Raiders,4,12:33,4,5,MIA 39,14-17,Matt Haack punts 42 yards muffed catch by Jordy Nelson recovered by Jordy Nelson and returned for no gain,0,-0.780,0.060,.84,42
2018-09-30,Jets,Jaguars,4,8:59,4,10,NYJ 14,12-25,Lac Edwards punts 46 yards muffed catch by Jaydon Mickens ball out of bounds at JAX-41,0,-2.470,-1.660,.81,46
2018-10-11,Giants,Eagles,4,12:27,4,17,NYG 33,13-34,Riley Dixon punts 50 yards muffed catch by DeAndre Carter recovered by DeAndre Carter and returned for no gain,0,-1.180,-0.040,1.14,50
2018-10-28,Jets,Bears,4,5:37,4,13,NYJ 37,10-24,Lac Edwards punts 48 yards muffed catch by Tarik Cohen recovered by Tarik Cohen and returned for no gain,0,-0.910,0.320,1.23,48
2018-11-25,Vikings,Packers,4,6:00,4,13,GNB 37,24-14,Matt Wile punts 21 yards muffed catch by Tramon Williams recovered by Marcus Sherels and returned for no gain,0,0.790,4.580,3.79,21
2018-12-13,Chiefs,Chargers,4,2:47,4,15,KAN 6,28-21,Dustin Colquitt punts 55 yards muffed catch by Desmond King recovered by Desmond King and returned for no gain,0,-2.490,-1.600,.89,55
2018-12-16,Bears,Packers,4,2:51,4,6,CHI 12,24-14,Pat O'Donnell punts 51 yards muffed catch by Josh Jackson recovered by Josh Jackson and returned for no gain,0,-2.490,-1.660,.83,51
2018-12-16,Eagles,Rams,4,3:03,4,12,PHI 15,30-23,Cameron Johnston punts 52 yards returned by Jojo Natson for 3 yards. Jojo Natson fumbles recovered by D.J. Alexander at LAR-36,0,-2.440,3.180,5.62,52,3
2018-12-02,Giants,Bears,4,12:46,4,18,NYG 12,24-14,Riley Dixon punts 53 yards returned by Tarik Cohen for 8 yards (tackle by Rhett Ellison). Tarik Cohen fumbles (forced by Rhett Ellison) recovered by Tarik Cohen at CHI-45. Penalty on Josh Bellamy: Illegal Block Above the Waist 10 yards,-2,-2.490,-0.670,1.82,53,8
2018-11-25,Jaguars,Bills,4,13:33,4,25,JAX 15,14-21,Logan Cooke punts 55 yards returned by Isaiah McKenzie for 9 yards (tackle by Jarrod Wilson). Isaiah McKenzie fumbles (forced by Jarrod Wilson) recovered by Isaiah McKenzie at BUF-43. Penalty on Marcus Murphy: Illegal Block Above the Waist 10 yards,-4,-2.440,-0.670,1.77,55,9
2018-09-06,Eagles,Falcons,4,7:42,4,14,PHI 21,10-12,Cameron Johnston punts 46 yards out of bounds,-1.960,-1.140,.82,46
2018-09-06,Falcons,Eagles,4,5:04,4,14,ATL 29,12-10,Matthew Bosher punts 52 yards returned by Darren Sproles for 12 yards (tackle by Eric Saubert). Penalty on Eric Saubert: Face Mask (15 Yards) 15 yards,-1.440,-1.990,-0.55,52,12

答案 1 :(得分:0)

您可以只使用熊猫

import pandas as pd

table = pd.read_html('https://www.pro-football-reference.com/play-index/play_finder.cgi?request=1&match=summary_all&year_min=2018&year_max=2018&game_type=R&game_num_min=0&game_num_max=99&week_num_min=0&week_num_max=99&quarter%5B%5D=4&minutes_max=15&seconds_max=00&minutes_min=00&seconds_min=00&down%5B%5D=0&down%5B%5D=1&down%5B%5D=2&down%5B%5D=3&down%5B%5D=4&field_pos_min_field=team&field_pos_max_field=team&end_field_pos_min_field=team&end_field_pos_max_field=team&type%5B%5D=PUNT&no_play=N&turnover_type%5B%5D=interception&turnover_type%5B%5D=fumble&score_type%5B%5D=touchdown&score_type%5B%5D=field_goal&score_type%5B%5D=safety&rush_direction%5B%5D=LE&rush_direction%5B%5D=LT&rush_direction%5B%5D=LG&rush_direction%5B%5D=M&rush_direction%5B%5D=RG&rush_direction%5B%5D=RT&rush_direction%5B%5D=RE&pass_location%5B%5D=SL&pass_location%5B%5D=SM&pass_location%5B%5D=SR&pass_location%5B%5D=DL&pass_location%5B%5D=DM&pass_location%5B%5D=DR&order_by=yards')[4]
table.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )