Question

我正在使用Pandas groupby来获取每年每月的前n个项目。

month_gr = df.groupby(by=[df.index.year, df.index.month_name(), df['Item Name']])
month_gr['Total'].sum().groupby(level=[0,1], group_keys=False).nlargest(5).sort_index(level=1)

这给我的输出为：

Order Datee  Order Datee  Item Name           
2020         August       12oz w/ lids            10097.50
                          8oz cup / lids          10246.50
                          Full fat Milk           32507.00
                          Grilled Chic WRAP       94166.58
                          Special Blend Beans     81855.00
             July         8oz cup / lids           4801.50
                          Arwa500ml                6700.41
                          Full fat Milk           13430.00
                          Spanish Latte ( R )      6480.00
                          Special Blend 500g      29880.00
             June         Full fat Milk            4740.00
                          MANAEESH CHEESE          3576.24
                          Marble cake              4810.65
                          NUTELLA CHEESECAKE       3350.90
                          Special Blend Beans      5652.00
             September    CLUB SANDWICH            1040.10
                          Cappuccino (Regular)     1404.80
                          Flat White (Regular)     1162.40
                          Ginger shot big          2016.00
                          Spanish Latte ( R )       926.40
Name: Total, dtype: float64

如果我使用sort_index（level = 1），它将按照字母顺序对值进行排序，从而得到相同的输出。但是，我想按以下每月订单排序：

cats = ['January', 'February', 'March', 'April','May','June', 'July', 'August','September', 'October', 'November', 'December']

我找到了一种使用pd.CategoricalIndex对月份进行排序的解决方案，但是我不知道如何将其用于多索引。

请解释一下如何根据月份（级别1）或更具体地按年份和月份（级别0和1）对上述数据进行排序。

Answer 1

DataFrame短路的示例。

df = pd.DataFrame({
        'year': [2020, 2020, 2020, 2020, 2020, 2020],
        'month_name': ['August', 'August', 'August', 'July', 'July', 'September'],
        'Item Name': ['a', 'b', 'c', 'd', 'e', 'f'],
        'Total': [1, 2, 3, 4, 5, 6]
    })

month_gr = df.groupby(by=['year', 'month_name', 'Item Name'])['Total'].sum()
print(month_gr)

打印：

year  month_name  Item Name
2020  August      a            1
                  b            2
                  c            3
      July        d            4
                  e            5
      September   f            6
Name: Total, dtype: int64

然后您可以重置索引，设置分类列，对值进行排序并重新设置索引：

month_gr = month_gr.reset_index()

cats = ['January', 'February', 'March', 'April','May','June', 'July', 'August','September', 'October', 'November', 'December']
month_gr['month_name'] = pd.Categorical(month_gr['month_name'], cats, ordered=True)

print(month_gr.sort_values(by=['year', 'month_name']).set_index(['year', 'month_name', 'Item Name']))

打印：

                           Total
year month_name Item Name       
2020 July       d              4
                e              5
     August     a              1
                b              2
                c              3
     September  f              6

使用分类索引值在特定级别对多索引系列进行排序

1 个答案: