Question

我正在尝试从Fangraphs抓取数据。这些表分为21页，但所有页面都使用相同的url。我对webscraping（或一般来说是python）很陌生，但是Fangraphs没有公共API，因此抓取页面似乎是我唯一的选择。我目前正在使用BeautifulSoup解析HTML代码，并且能够抓取初始表，但是该表仅包含前30个播放器，但我需要整个播放器池。经过两天的网络搜索，我被困住了。链接和我当前的代码如下。我知道他们有下载csv文件的链接，但这在整个季节都很繁琐，我想加快数据收集过程。任何方向都会有所帮助，谢谢。

https://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=fangraphsdc

import requests
import pandas as pd

url = 'https://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=fangraphsdc&team=0&lg=all&players=0'

response = requests.get(url, verify=False)

# Use BeautifulSoup to parse the HTML code
soup = BeautifulSoup(response.content, 'html.parser')

# changes stat_table from ResultSet to a Tag
stat_table = stat_table[0]

# Convert html table to list
rows = []
for tr in stat_table.find_all('tr')[1:]:
    cells = []
    tds = tr.find_all('td')
    if len(tds) == 0:
        ths = tr.find_all('th')
        for th in ths:
            cells.append(th.text.strip())
    else:
        for td in tds:
            cells.append(td.text.strip())
    rows.append(cells)

# convert table to df
table = pd.DataFrame(rows)

Answer 1

.
.
.
 "ngx-toastr": "^12.0.1",
.
.
.

输出：view-online

使用单个URL刮擦多页上的表

1 个答案: