从URL列表中刮取表格数据(每个URL包含一个唯一的表格),以便将其全部附加到单个列表/数据框架中?

时间:2017-08-24 02:31:23

标签: python pandas beautifulsoup statistics data-science

我正在从数百个网址列表中抓取数据,每个网址都包含一张包含统计棒球数据的表格。在列表中的每个唯一URL中,有一个表格可以显示单个棒球运动员职业生涯的所有季节,如下所示:

https://www.baseball-reference.com/players/k/killeha01.shtml

我已成功创建了一个脚本,可将单个网址中的数据附加到单个列表/数据框中。但是,这是我的问题:

我应该如何调整代码以从该域中抓取数百个网址的完整列表,然后将所有网址中的所有表格行追加到单个列表/数据框中?

我抓取单个网址的一般格式如下:

import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup

url_baseball_players = ['https://www.baseball-reference.com/players/k/killeha01.shtml']

def scrape_baseball_data(url_parameter):

    html = urlopen(url_parameter)

    # create the BeautifulSoup object
    soup = BeautifulSoup(html, "lxml")

    column_headers = [SCRAPING COMMAND WITH CSS SELECTOR GADGET FOR GETTING COLUMN HEADERS]

    table_rows = soup.select(SCRAPING COMMAND WITH CSS SELECTOR GADGET FOR GETTING ALL OF THE DATA FROM THE TABLES INCLUDING HTML CHARACTERS)

    player_data = []

    for row in table_rows:  

        player_list = [COMMANDS FOR SCRAPING HTML DATA FROM THE TABLES INTO AN ORGANIZED LIST]

        if not player_list:
            continue

        player_data.append(player_list)

    return player_data

list_baseball_player_data = scrape_baseball_data(url_baseball_players)

df_baseball_player_data = pd.DataFrame(list_baseball_player_data)

1 个答案:

答案 0 :(得分:3)

如果url_baseball_players是您要抓取的所有网址的列表,并且您的预期输出是一个数据框(您按行添加,按行,每个新网址的数据),那么只需添加{在迭代网址时{1}}:

concat()