使用pandas将dfs列表从pd.read_html转换为dfs

时间:2019-12-29 00:17:06

标签: python pandas

是否可以修改pd.read_html,使其返回一个数据框而不是数据框列表?

上下文: 我正在尝试使用pandas read_html从网站导入表。我了解pd.read_html返回dfs列表,而不是单个数据帧。我一直在通过将pd.read_html返回的列表中的第一个(也是唯一一个数据帧)分配给新变量来规避这一问题。但是,我想将不同URL中的多个数据框存储在主字典中(使用下面的代码),并且希望这些值是数据框元素,而不是列表。

urls_dict = {
    '2017': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2017',
    '2016': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2016',
    '2015': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2015',
    '2014': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2014',
    '2013': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2013',
    '2012': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2012',
    '2011': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2011',
    '2010': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2010',
    '2009': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2009'        
}

dfs_dict = {}
for key, url in urls_dict.items():
   dfs_dict[key] = pd.read_html(url)

2 个答案:

答案 0 :(得分:3)

使用pd.concat内的列表推导来合并每年的数据框(使用.assign(year=year)将相应的年份添加为一列)。

请注意,pd.read_html(url)返回数据帧列表。对于给定的url,列表的长度永远不会超过一个,因此请使用pd.read_html(url)[0]访问实际的数据框,然后使用assign作为列的年份。

dfs = pd.concat([pd.read_html(url)[0].assign(year=year) for year, url in urls_dict.items()])

请注意,您可以使用以下字典理解和f-strings(Python 3.6中引入的格式化字符串文字)一起创建urls_dict

years = range(2009, 2018)
urls_dict = {
    str(year): f'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year={year}' 
    for year in years
}

答案 1 :(得分:1)

IIUC,我们可以对您的代码进行少量编辑,然后致电pd.concat来确认您使用pd.read_html进行的所有呼叫

dfs = {}  # initlaise the loop.
# acess the key and values of a dictionary.
# in {'2017' : [1,2,3]} 2017 is the key and [1,2,3] are the values. 
for key, url in urls_dict.items(): 
# for each unique item in your dict, read in the url and concat the list using pd.concat
    dfs[key] =(pd.concat(pd.read_html(url))) 
    dfs[key]['grad_year'] = key # if you want to assign the key to a column.
    dfs[key] = dfs[key].drop('PGY',axis=1) # drop PGY.

print(dfs['2017'].iloc[:5,:3])
   PGY         Type                       Name
0  PGY-1  Categorical       Van Denakker, Tayler
1  PGY-1  Preliminary  Bisharat-Kernizan, Jumana
2  PGY-1  Preliminary        Schiffenhaus, James
3  PGY-1  Categorical            Collins, Kelsey
4  PGY-1  Categorical             Saker, Erfanul

type(dfs['2017'])
pandas.core.frame.DataFrame