Question

我的数据包含各种金融证券的价格，数量和其他数据。我的输入数据如下所示：

import numpy as np
import pandas

prices = np.random.rand(15) * 100
volumes = np.random.randint(15, size=15) * 10
idx = pandas.Series([2007, 2007, 2007, 2007, 2007, 2008,
                     2008, 2008, 2008, 2008, 2009, 2009,
                     2009, 2009, 2009], name='year')
df = pandas.DataFrame.from_items([('price', prices), ('volume', volumes)])
df.index = idx

# BELOW IS AN EXMPLE OF WHAT INPUT MIGHT LOOK LIKE
# IT WON'T BE EXACT BECAUSE OF THE USE OF RANDOM
#           price  volume
# year
# 2007   0.121002      30
# 2007  15.256424      70
# 2007  44.479590      50
# 2007  29.096013       0
# 2007  21.424690       0
# 2008  23.019548      40
# 2008  90.011295       0
# 2008  88.487664      30
# 2008  51.609119      70
# 2008   4.265726      80
# 2009  34.402065     140
# 2009  10.259064     100
# 2009  47.024574     110
# 2009  57.614977     140
# 2009  54.718016      50

我想生成一个看起来像这样的数据框：

year       2007       2008       2009
0      0.121002  23.019548  34.402065
1     15.256424  90.011295  10.259064
2     44.479590  88.487664  47.024574
3     29.096013  51.609119  57.614977
4     21.424690   4.265726  54.718016

我知道使用groupby生成上面输出的一种方法：

df = df.reset_index()
grouper = df.groupby('year')
df2 = None
for group, data in grouper:
    series = data['price'].copy()
    series.index = range(len(series))
    series.name = group
    df2 = pandas.DataFrame(series) if df2 is None else pandas.concat([df2, series], axis=1)

而且我也知道你可以做一个数据框来获得一个数据框架，它为枢轴上缺少的索引提供了NaNs：

# df = df.reset_index()
df.pivot(columns='year', values='price')

# Output
# year       2007       2008       2009
# 0      0.121002        NaN        NaN
# 1     15.256424        NaN        NaN
# 2     44.479590        NaN        NaN
# 3     29.096013        NaN        NaN
# 4     21.424690        NaN        NaN
# 5           NaN  23.019548        NaN
# 6           NaN  90.011295        NaN
# 7           NaN  88.487664        NaN
# 8           NaN  51.609119        NaN
# 9           NaN   4.265726        NaN
# 10          NaN        NaN  34.402065
# 11          NaN        NaN  10.259064
# 12          NaN        NaN  47.024574
# 13          NaN        NaN  57.614977
# 14          NaN        NaN  54.718016

我的问题如下：

有没有办法可以在不创建系列的情况下在groupby中创建输出DataFrame，或者有没有办法可以重新索引我的输入DataFrame，以便使用pivot获得所需的输出？

Answer 1

你需要每年标注0-4。为此，请在分组后使用cumcount。然后，您可以使用该新列作为索引正确转动。

df['year_count'] = df.groupby(level='year').cumcount()
df.reset_index().pivot(index='year_count', columns='year', values='price')

year             2007       2008       2009
year_count                                 
0           61.682275  32.729113  54.859700
1           44.231296   4.453897  45.325802
2           65.850231  82.023960  28.325119
3           29.098607  86.046499  71.329594
4           67.864723  43.499762  19.255214

Answer 2

您可以groupby使用apply创建的Series新numpy array values再unstack重新构建{{3}}：

print (df.groupby(level='year')['price'].apply(lambda x: pd.Series(x.values)).unstack(0))
year       2007       2008       2009
0     55.360804  68.671626  78.809139
1     50.246485  55.639250  84.483814
2     17.646684  14.386347  87.185550
3     54.824732  91.846018  60.793002
4     24.303751  50.908714  22.084445

重置从groupby或pivot创建的pandas DataFrame的索引？

2 个答案: