Python pandas - 聚合子portfolios

时间:2017-04-21 19:15:55

标签: python python-2.7 pandas

我有两个数据框,一个是每个子组合的值,另一个是一个更高级别的投资组合列表,每个子文件夹都汇总到这些数据框。

table1
subportfolio value
top-alpha-1  1
top-alpha-2  2
top-alpha-3  3
top-beta-1   4
top-beta-2   5
top-beta-3   6
top-gamma-1  7
top-gamma-2  8
top-gamma-3  9

table2
portfolio    parent     level
top-alpha-1  top-alpha  1
top-alpha-2  top-alpha  1
top-alpha-3  top-alpha  1
top-beta-1   top-beta   1
top-beta-2   top-beta   1
top-beta-3   top-beta   1
top-gamma-1  top-gamma  1
top-gamma-2  top-gamma  1
top-gamma-3  top-gamma  1
top-alpha    top        2
top-beta     top        2
top-gamma    top        2
top          self       3

我的目标是以某种方式合并这两个表,这样不仅子资源库可以填充值,而且所有较高级别都会根据它们下面的投资组合聚合得到指定值。

我的第一个想法是某种迭代,但由于它的大量数据,这可能非常耗时。

table2
portfolio    value parent     level
top-alpha-1  1     top-alpha  1
top-alpha-2  2     top-alpha  1
top-alpha-3  3     top-alpha  1
top-beta-1   4     top-beta   1
top-beta-2   5     top-beta   1
top-beta-3   6     top-beta   1
top-gamma-1  7     top-gamma  1
top-gamma-2  8     top-gamma  1
top-gamma-3  9     top-gamma  1
top-alpha    6     top        2
top-beta     15    top        2
top-gamma    24    top        2
top          45    self       3

3 个答案:

答案 0 :(得分:3)

新答案

注意:我已将列名'subportfolio'更改为'portfolio'

def agg_lvl(t1, t2):
    lcol = ['level', 'portfolio']
    rcol = ['parent', 'portfolio']
    kwargs = dict(
        left_on='portfolio', right_on='parent',
        suffixes=['_', '']
    )
    lvl = t2[lcol].merge(t2[rcol], **kwargs).drop('portfolio_', 1).merge(t1)
    lvl = lvl.groupby('parent').value.sum().rename_axis('portfolio').reset_index()
    return t1.append(lvl, ignore_index=True).drop_duplicates(), t2

o1, o2 = agg_lvl(*agg_lvl(table1, table2))

o2.merge(o1)

    level     parent    portfolio  value
0       1  top-alpha  top-alpha-1      1
1       1  top-alpha  top-alpha-2      2
2       1  top-alpha  top-alpha-3      3
3       1   top-beta   top-beta-1      4
4       1   top-beta   top-beta-2      5
5       1   top-beta   top-beta-3      6
6       1  top-gamma  top-gamma-1      7
7       1  top-gamma  top-gamma-2      8
8       1  top-gamma  top-gamma-3      9
9       2        top    top-alpha      6
10      2        top     top-beta     15
11      2        top    top-gamma     24
12      3       self          top     45

设置

table2 = pd.DataFrame({
        'level': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3],
        'parent': [
            'top-alpha',
            'top-alpha',
            'top-alpha',
            'top-beta',
            'top-beta',
            'top-beta',
            'top-gamma',
            'top-gamma',
            'top-gamma',
            'top',
            'top',
            'top',
            'self'],
        'portfolio': [
            'top-alpha-1',
            'top-alpha-2',
            'top-alpha-3',
            'top-beta-1',
            'top-beta-2',
            'top-beta-3',
            'top-gamma-1',
            'top-gamma-2',
            'top-gamma-3',
            'top-alpha',
            'top-beta',
            'top-gamma',
            'top']})

table1 = pd.DataFrame({
        'portfolio': ['top-alpha-1', 'top-alpha-2', 'top-alpha-3', 'top-beta-1', 'top-beta-2', 'top-beta-3', 'top-gamma-1', 'top-gamma-2', 'top-gamma-3'],
        'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]
    })

旧答案

此解决方案利用了我的另一种解决方案,可能并不完全符合您的需求......但话又说回来,您并没有明确说明您需要什么。所以我采取了一些自由

首先,我创建了另一个数据框df,我将subportfolio列拆分为'-'

col = 'subportfolio'
rnm_dict = dict(enumerate(list('321')))
df = table1.drop(col, 1).join(table1[col].str.split('-', expand=True).rename(columns=rnm_dict))
print(df)

   value    3      2  1
0      1  top  alpha  1
1      2  top  alpha  2
2      3  top  alpha  3
3      4  top   beta  1
4      5  top   beta  2
5      6  top   beta  3
6      7  top  gamma  1
7      8  top  gamma  2
8      9  top  gamma  3

现在运行聚合

agged = pd.concat([
        df.assign(
            **{x: '' for x in '321'[i:]}
        ).groupby(list('321')).sum() for i in range(1, 4)
    ]).sort_index()


table2.join(agged.set_index(agged.index.to_series().str.join('-').str.strip('-').values), on='portfolio')

    level     parent    portfolio  value
0       1  top-alpha  top-alpha-1      1
1       1  top-alpha  top-alpha-2      2
2       1  top-alpha  top-alpha-3      3
3       1   top-beta   top-beta-1      4
4       1   top-beta   top-beta-2      5
5       1   top-beta   top-beta-3      6
6       1  top-gamma  top-gamma-1      7
7       1  top-gamma  top-gamma-2      8
8       1  top-gamma  top-gamma-3      9
9       2        top    top-alpha      6
10      2        top     top-beta     15
11      2        top    top-gamma     24
12      3       self          top     45

答案 1 :(得分:2)

table3 = table2.merge(table1, 
                      left_on="portfolio", 
                      right_on="subportfolio", 
                      how="left").drop('subportfolio', axis=1)
table3['letter'] = table3.portfolio.str.split('-').str[1]
table3.loc[table3.level==2, 'value'] = table3.groupby('letter').value.sum().values
table3.loc[table3.level==3, 'value'] = table3.loc[table3.level==2, 'value'].sum()
table3.drop('letter', axis=1, inplace=True)

# output
      portfolio     parent  level  value
0   top-alpha-1  top-alpha      1    1.0
1   top-alpha-2  top-alpha      1    2.0
2   top-alpha-3  top-alpha      1    3.0
3    top-beta-1   top-beta      1    4.0
4    top-beta-2   top-beta      1    5.0
5    top-beta-3   top-beta      1    6.0
6   top-gamma-1  top-gamma      1    7.0
7   top-gamma-2  top-gamma      1    8.0
8   top-gamma-3  top-gamma      1    9.0
9     top-alpha        top      2    6.0
10     top-beta        top      2   15.0
11    top-gamma        top      2   24.0
12          top       self      3   45.0

答案 2 :(得分:0)

感谢所有答案。我已经窃取了你们给我的想法,并试图建立尽可能动态的东西(任意数量的级别,任何格式的投资组合等)。

df = table2.merge(table1, on="portfolio", how="left")
for i in range(2,df.level.max()+1):
    df1 = df.loc[df.level==i-1,:].groupby('parent', 
            as_index=False).sum().rename(columns=
            {"parent":"portfolio"}).set_index('portfolio')
    df = df.set_index('portfolio').combine_first(df1).reset_index()

我使用了'设置'由piRsquared在他的回答中提供。结果:

      portfolio  level     parent  value
0           top      3       self   45.0
1     top-alpha      2        top    6.0
2   top-alpha-1      1  top-alpha    1.0
3   top-alpha-2      1  top-alpha    2.0
4   top-alpha-3      1  top-alpha    3.0
5      top-beta      2        top   15.0
6    top-beta-1      1   top-beta    4.0
7    top-beta-2      1   top-beta    5.0
8    top-beta-3      1   top-beta    6.0
9     top-gamma      2        top   24.0
10  top-gamma-1      1  top-gamma    7.0
11  top-gamma-2      1  top-gamma    8.0
12  top-gamma-3      1  top-gamma    9.0

如果您想保持投资组合的顺序,可以使用

df = df.sort_values('level')