我正在撰写经济学论文,需要一些帮助来组合和转换两个数据集。
我有两个pandas数据帧,一个包含国家列表及其邻居(borderdf),例如
borderdf
country neighbor
sweden norway
sweden denmark
denmark germany
denmark sweden
和每个国家和年份的数据(datadf),例如
datadf
country gdp year
sweden 5454 2004
sweden 5676 2005
norway 3433 2004
norway 3433 2005
denmark 2132 2004
denmark 2342 2005
我需要在datadf中为neighbormeangdp创建一个列,其中包含neighbordf给出的所有邻居的gdp的平均值。我希望我的结果看起来像这样:
datadf
country year gdp neighborsmeangdp
sweden 2004 5454 5565
sweden 2005 5676 5775
我该怎么做呢?
答案 0 :(得分:0)
我认为直接的方法是将GDP值放在border
DataFrame
中。然后,只需要sum
groupby
个merge
对象,然后执行In [178]:
borderdf[2004]=[datadf2.ix[(item, 2004)].values[0] for item in borderdf.neighbor]
borderdf[2005]=[datadf2.ix[(item, 2005)].values[0] for item in borderdf.neighbor]
gpdf=borderdf.groupby(by=['country']).sum()
df=pd.DataFrame(gpdf.unstack(), columns=['neighborsmeangdp'])
df=df.reset_index()
df=df.rename(columns = {'level_0':'year'})
print pd.ordered_merge(datadf, df)
country gdp year neighborsmeangdp
0 denmark 2132 2004 7586
1 germany 2132 2004 NaN
2 norway 3433 2004 NaN
3 sweden 5454 2004 5565
4 denmark 2342 2005 8018
5 germany 2342 2005 NaN
6 norway 3433 2005 NaN
7 sweden 5676 2005 5775
[8 rows x 4 columns]
:
germany 2132 2004
germany 2342 2005
当然,我必须为德国补充一些数据,
{{1}}
我确信她实际上做得更好。
答案 1 :(得分:0)
您可以使用pandas merge
函数直接合并两者。
这里的诀窍是,您实际上想要将datadf
中的国家/地区列与borderdf
中的邻居列合并。
然后使用groupby
和mean
获取平均邻居gdp。
最后,与数据合并以获得国家自己的GDP。
例如:
import pandas as pd
from StringIO import StringIO
border_csv = '''
country, neighbor
sweden, norway
sweden, denmark
denmark, germany
denmark, sweden
'''
data_csv = '''
country, gdp, year
sweden, 5454, 2004
sweden, 5676, 2005
norway, 3433, 2004
norway, 3433, 2005
denmark, 2132, 2004
denmark, 2342, 2005
'''
borders = pd.read_csv(StringIO(border_csv), sep=',\s*', header=1)
data = pd.read_csv(StringIO(data_csv), sep=',\s*', header=1)
merged = pd.merge(borders,data,left_on='neighbor',right_on='country')
merged = merged.drop('country_y', axis=1)
merged.columns = ['country','neighbor','gdp','year']
grouped = merged.groupby(['country','year'])
neighbor_means = grouped.mean()
neighbor_means.columns = ['neighbor_gdp']
neighbor_means.reset_index(inplace=True)
results_df = pd.merge(neighbor_means,data, on=['country','year'])