Question

我有一个看起来像这样的数据框：

df = pd.DataFrame({"Object": ['Apple', 'Orange', 'Banana', 'Grape', 'Cherry'], 
                   "Jan 01 Vol": [0, 5, 2, 4, 8],
                  "Jan 01 Price": [1.15, 2.30, 1.75, 3.4, 2.5],
                  "Jan 01 Sales": [0, 11.5, 5.25, 13.6, 20],
                  "Jan 02 Vol": [1, 2, 3, 4, 5],
                  "Jan 02 Price": [1.15, 2.30, 1.75, 3.4, 2.5],
                  "Jan 02 Sales": [1.15, 4.6, 5.25, 13.6, 12.5],
                  "Feb 01 Vol": [5, 4, 3, 2, 1],
                  "Feb 01 Price": [1.15, 2.30, 1.75, 3.4, 2.5],
                  "Feb 01 Sales": [5.75, 9.2, 5.25, 6.8, 2.5],})

我希望能够操纵数据框，使“ Vol”，“ Price”，“ Sales”成为它们自己的列，同时垂直旋转该列的日期方面，使其看起来像这样：

df2 = pd.DataFrame({"Object": ['Apple', 'Apple', 'Apple', 
                               'Orange','Orange', 'Orange', 
                               'Banana', 'Banana', 'Banana', 
                               'Grape', 'Grape', 'Grape', 
                               'Cherry', 'Cherry', 'Cherry'], 
                    "Year": [2001, 2001, 2002, 
                             2001, 2001, 2002, 
                             2001, 2001, 2002, 
                             2001, 2001, 2002, 
                             2001, 2001, 2002],
                   "Month": [1, 2, 1, 
                             1, 2, 1, 
                             1, 2, 1, 
                             1, 2, 1, 
                             1, 2, 1],
                    "Vol": [0, 5, 1, 5, 4, 2, 2, 3, 3, 4, 2, 4, 8, 1, 5],
                   "Price": [1.15, 1.15, 1.15, 2.30, 2.30, 2.30, 1.75, 1.75, 1.75, 3.4, 3.4, 3.4, 2.5, 2.5, 2.5],
                   "Sales": [0, 5.75, 1.15, 11.50, 9.2, 4.6, 5.25, 5.25, 5.25, 13.60, 6.8, 13.60, 20, 2.5, 12.5]})

我考虑过做一个lambda函数，该函数创建一个新列，该列从水平列名称中提取年份，但是由于数组长度不同而无法正常工作。我也考虑过要创建数据透视表，但是同样，我不确定如何将这些列的“ Vol”，“ Price”，“ Sales”方面解析为各自的列。任何帮助将不胜感激。

Answer 1

dfm = df.melt(id_vars='Object')

df3 = pd.concat([dfm[['Object', 'value']], dfm['variable'].str.split(expand=True)], axis=1)
df3.rename(columns={0: 'Month', 1: 'Year', 2:'Type'}, inplace=True)
df3 = df3.set_index(['Object', 'Year', 'Month', 'Type']).unstack()['value'].reset_index()
df3['Year'] = df3['Year'].astype(int)+2000
df3['Month'] = pd.to_datetime(df3['Month'], format='%b').dt.month

#Output
#Type  Object  Year  Month  Price  Sales  Vol
#0      Apple  2001      2   1.15   5.75  5.0
#1      Apple  2001      1   1.15   0.00  0.0
#2      Apple  2002      1   1.15   1.15  1.0
#3     Banana  2001      2   1.75   5.25  3.0
#4     Banana  2001      1   1.75   5.25  2.0
#5     Banana  2002      1   1.75   5.25  3.0
#6     Cherry  2001      2   2.50   2.50  1.0
#7     Cherry  2001      1   2.50  20.00  8.0
#8     Cherry  2002      1   2.50  12.50  5.0
#9      Grape  2001      2   3.40   6.80  2.0
#10     Grape  2001      1   3.40  13.60  4.0
#11     Grape  2002      1   3.40  13.60  4.0
#12    Orange  2001      2   2.30   9.20  4.0
#13    Orange  2001      1   2.30  11.50  5.0
#14    Orange  2002      1   2.30   4.60  2.0

我将首先使用pd.melt进行转换。将.str.split与expand=True一起使用，可将列variable（由pd.melt从列中构造）中的信息拆分为三个单独的列，并将其重命名为有意义的内容。然后使用set_index，这样我们就可以unstack了，它可以根据需要将信息分为三列，从长格式扩展为宽格式。最后，将日期时间特征更改为所需的数字。

希望有帮助

Answer 2

您可以将pd.wide_to_long与某些列重命名一起使用，并将pd.to_datetime与.dt访问器一起使用，以获取year和month属性：

df = df.set_index('Object')
df.columns =  df.columns.str.replace(r'(.+) ([Vol|Price|Sales]+)',r'\2_\1')
df_out = pd.wide_to_long(df.reset_index(),['Vol','Price','Sales'],'Object','Months','_','.+')
df_out = df_out.reset_index()
df_out['Months'] = pd.to_datetime(df_out['Months'], format='%b %y')
df_out['Year'] = df_out['Months'].dt.year
df_out['Month'] = df_out['Months'].dt.month
df_out = df_out.drop('Months', axis=1).sort_values(['Object','Year'])
print(df_out)

输出：

    Object  Vol  Price  Sales  Year  Month
0    Apple    0   1.15   0.00  2001      1
10   Apple    5   1.15   5.75  2001      2
5    Apple    1   1.15   1.15  2002      1
2   Banana    2   1.75   5.25  2001      1
12  Banana    3   1.75   5.25  2001      2
7   Banana    3   1.75   5.25  2002      1
4   Cherry    8   2.50  20.00  2001      1
14  Cherry    1   2.50   2.50  2001      2
9   Cherry    5   2.50  12.50  2002      1
3    Grape    4   3.40  13.60  2001      1
13   Grape    2   3.40   6.80  2001      2
8    Grape    4   3.40  13.60  2002      1
1   Orange    5   2.30  11.50  2001      1
11  Orange    4   2.30   9.20  2001      2
6   Orange    2   2.30   4.60  2002      1

带有重复列类别的Pandas数据透视表

2 个答案: