如何在Pandas数据集上组合并进行组计算?

时间:2014-04-22 14:56:21

标签: python pandas economics

我正在撰写经济学论文,需要一些帮助来组合和转换两个数据集。

我有两个pandas数据帧,一个包含国家列表及其邻居(borderdf),例如

borderdf
country    neighbor
sweden     norway
sweden     denmark
denmark    germany
denmark    sweden

和每个国家和年份的数据(datadf),例如

datadf
country    gdp    year
sweden     5454   2004
sweden     5676   2005
norway     3433   2004
norway     3433   2005
denmark    2132   2004
denmark    2342   2005

我需要在datadf中为neighbormeangdp创建一个列,其中包含neighbordf给出的所有邻居的gdp的平均值。我希望我的结果看起来像这样:

datadf
country    year    gdp    neighborsmeangdp
sweden     2004    5454   5565
sweden     2005    5676   5775

我该怎么做呢?

2 个答案:

答案 0 :(得分:0)

我认为直接的方法是将GDP值放在border DataFrame中。然后,只需要sum groupbymerge对象,然后执行In [178]: borderdf[2004]=[datadf2.ix[(item, 2004)].values[0] for item in borderdf.neighbor] borderdf[2005]=[datadf2.ix[(item, 2005)].values[0] for item in borderdf.neighbor] gpdf=borderdf.groupby(by=['country']).sum() df=pd.DataFrame(gpdf.unstack(), columns=['neighborsmeangdp']) df=df.reset_index() df=df.rename(columns = {'level_0':'year'}) print pd.ordered_merge(datadf, df) country gdp year neighborsmeangdp 0 denmark 2132 2004 7586 1 germany 2132 2004 NaN 2 norway 3433 2004 NaN 3 sweden 5454 2004 5565 4 denmark 2342 2005 8018 5 germany 2342 2005 NaN 6 norway 3433 2005 NaN 7 sweden 5676 2005 5775 [8 rows x 4 columns]

germany    2132   2004
germany    2342   2005

当然,我必须为德国补充一些数据,

{{1}}

我确信她实际上做得更好。

答案 1 :(得分:0)

您可以使用pandas merge函数直接合并两者。 这里的诀窍是,您实际上想要将datadf中的国家/地区列与borderdf中的邻居列合并。 然后使用groupbymean获取平均邻居gdp。 最后,与数据合并以获得国家自己的GDP。 例如:

import pandas as pd
from StringIO import StringIO

border_csv = '''
country, neighbor
sweden, norway
sweden, denmark
denmark, germany
denmark, sweden
'''

data_csv = '''
country, gdp, year
sweden, 5454, 2004
sweden, 5676, 2005
norway, 3433, 2004
norway, 3433, 2005
denmark, 2132, 2004
denmark, 2342, 2005
'''

borders = pd.read_csv(StringIO(border_csv), sep=',\s*', header=1)
data = pd.read_csv(StringIO(data_csv), sep=',\s*', header=1)

merged = pd.merge(borders,data,left_on='neighbor',right_on='country')
merged = merged.drop('country_y', axis=1)
merged.columns = ['country','neighbor','gdp','year']


grouped = merged.groupby(['country','year'])
neighbor_means = grouped.mean()
neighbor_means.columns = ['neighbor_gdp']
neighbor_means.reset_index(inplace=True)

results_df = pd.merge(neighbor_means,data, on=['country','year'])