Question

在Windows设备上运行Python 3.6.1 | Anaconda 4.4.0（64位）。

使用 selenium 我收集以下html源代码：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url = "https://nextgenstats.nfl.com/stats/receiving#yards"
driver = webdriver.Chrome(executable_path=r"C:/Program Files (x86)/Google/Chrome/chromedriver.exe")
driver.get(url)
htmlSource = driver.page_source

如果检查了网址，他们会看到一个动态加载的好桌子。我不确定如何从htmlsource中提取此表，以便可以从中构造Pandas数据帧。

Answer 1

你非常接近。你只需要在这里帮助一点熊猫。这就是你需要做的事情。

将来源加载到BeautifulSoup
找到有问题的表格。使用soup.find
致电pd.read_html

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlSource, 'html.parser')
table = soup.find('div', class_='ngs-data-table')

df_list = pd.read_html(table.prettify())

现在，df_list包含该页面上所有表格的列表 -

df_list[1].head()

                0    1   2    3    4     5      6   7    8      9     10  11
0    Antonio Brown  PIT  WR  4.3  2.6  13.7  45.32  99  160  61.88  1509   9
1  DeAndre Hopkins  HOU  WR  4.6  2.1  13.1  42.19  88  155  56.77  1232  11
2     Adam Thielen  MIN  WR  5.8  2.6  11.0  37.38  80  124  64.52  1161   4
3      Julio Jones  ATL  WR  5.2  2.4  14.2  43.34  73  118  61.86  1161   3
4     Keenan Allen  LAC  WR  5.4  2.6   9.5  31.30  83  129  64.34  1143   5

Answer 2

作为Scrapy用户，我常常查看XHR请求。如果您在网站中更改年份，则会看到对https://appapi.ngs.nfl.com/statboard/receiving?season=2017&seasonType=REG

的API调用

API返回JSON，因此对数据使用read_json之类的JSON解析器是有意义的。

以下是如何使用它的Scrapy shell：

$ scrapy shell

In [1]: fetch("https://appapi.ngs.nfl.com/statboard/receiving?season=2017&seasonType=REG")
2017-12-15 13:11:30 [scrapy.core.engine] INFO: Spider opened
2017-12-15 13:11:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://appapi.ngs.nfl.com/statboard/receiving?season=2017&seasonType=REG> (referer: None)

In [2]: import pandas as pd

In [3]: data = pd.read_json(response.body)

In [4]: data.keys()
Out[4]: Index([u'season', u'seasonType', u'stats', u'threshold'], dtype='object')

In [5]: pd.DataFrame(list(data['stats']))

如果您没有接受过scrapy，可以使用requests

import requests
import pandas as pd

url = "https://appapi.ngs.nfl.com/statboard/receiving?season=2017&seasonType=REG"

response = requests.get(url)
data = pd.read_json(response.text)
df = pd.DataFrame(list(data['stats']))

将动态加载的表转换为Pandas Dataframe

2 个答案: