我正在从数百个网址列表中抓取数据,每个网址都包含一张包含统计棒球数据的表格。在列表中的每个唯一URL中,有一个表格可以显示单个棒球运动员职业生涯的所有季节,如下所示:
https://www.baseball-reference.com/players/k/killeha01.shtml
我已成功创建了一个脚本,可将单个网址中的数据附加到单个列表/数据框中。但是,这是我的问题:
我应该如何调整代码以从该域中抓取数百个网址的完整列表,然后将所有网址中的所有表格行追加到单个列表/数据框中?
我抓取单个网址的一般格式如下:
import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
url_baseball_players = ['https://www.baseball-reference.com/players/k/killeha01.shtml']
def scrape_baseball_data(url_parameter):
html = urlopen(url_parameter)
# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
column_headers = [SCRAPING COMMAND WITH CSS SELECTOR GADGET FOR GETTING COLUMN HEADERS]
table_rows = soup.select(SCRAPING COMMAND WITH CSS SELECTOR GADGET FOR GETTING ALL OF THE DATA FROM THE TABLES INCLUDING HTML CHARACTERS)
player_data = []
for row in table_rows:
player_list = [COMMANDS FOR SCRAPING HTML DATA FROM THE TABLES INTO AN ORGANIZED LIST]
if not player_list:
continue
player_data.append(player_list)
return player_data
list_baseball_player_data = scrape_baseball_data(url_baseball_players)
df_baseball_player_data = pd.DataFrame(list_baseball_player_data)
答案 0 :(得分:3)
如果url_baseball_players
是您要抓取的所有网址的列表,并且您的预期输出是一个数据框(您按行添加,按行,每个新网址的数据),那么只需添加{在迭代网址时{1}}:
concat()