在Windows设备上运行Python 3.6.1 | Anaconda 4.4.0(64位)。
使用 selenium 我收集以下html源代码:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = "https://nextgenstats.nfl.com/stats/receiving#yards"
driver = webdriver.Chrome(executable_path=r"C:/Program Files (x86)/Google/Chrome/chromedriver.exe")
driver.get(url)
htmlSource = driver.page_source
如果检查了网址,他们会看到一个动态加载的好桌子。我不确定如何从htmlsource
中提取此表,以便可以从中构造Pandas数据帧。
答案 0 :(得分:3)
你非常接近。你只需要在这里帮助一点熊猫。这就是你需要做的事情。
BeautifulSoup
soup.find
pd.read_html
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlSource, 'html.parser')
table = soup.find('div', class_='ngs-data-table')
df_list = pd.read_html(table.prettify())
现在,df_list
包含该页面上所有表格的列表 -
df_list[1].head()
0 1 2 3 4 5 6 7 8 9 10 11
0 Antonio Brown PIT WR 4.3 2.6 13.7 45.32 99 160 61.88 1509 9
1 DeAndre Hopkins HOU WR 4.6 2.1 13.1 42.19 88 155 56.77 1232 11
2 Adam Thielen MIN WR 5.8 2.6 11.0 37.38 80 124 64.52 1161 4
3 Julio Jones ATL WR 5.2 2.4 14.2 43.34 73 118 61.86 1161 3
4 Keenan Allen LAC WR 5.4 2.6 9.5 31.30 83 129 64.34 1143 5
答案 1 :(得分:2)
作为Scrapy用户,我常常查看XHR请求。如果您在网站中更改年份,则会看到对https://appapi.ngs.nfl.com/statboard/receiving?season=2017&seasonType=REG
的API调用 API返回JSON,因此对数据使用read_json
之类的JSON解析器是有意义的。
以下是如何使用它的Scrapy shell:
$ scrapy shell
In [1]: fetch("https://appapi.ngs.nfl.com/statboard/receiving?season=2017&seasonType=REG")
2017-12-15 13:11:30 [scrapy.core.engine] INFO: Spider opened
2017-12-15 13:11:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://appapi.ngs.nfl.com/statboard/receiving?season=2017&seasonType=REG> (referer: None)
In [2]: import pandas as pd
In [3]: data = pd.read_json(response.body)
In [4]: data.keys()
Out[4]: Index([u'season', u'seasonType', u'stats', u'threshold'], dtype='object')
In [5]: pd.DataFrame(list(data['stats']))
如果您没有接受过scrapy,可以使用requests
import requests
import pandas as pd
url = "https://appapi.ngs.nfl.com/statboard/receiving?season=2017&seasonType=REG"
response = requests.get(url)
data = pd.read_json(response.text)
df = pd.DataFrame(list(data['stats']))