如何在python

时间:2015-06-26 16:22:10

标签: python pandas

我有pandas格式的数据框:

                                                      line_idno  item_name \
sitename                        ts_placed order_idno
www.mattressesworld.co.uk       47        5242367       4112061  a
www.bedroomfurnitureworld.co.uk 47        5242295       4111977  b  
                                          5242295       4111979  a 
                                          5242295       4111978  v 
                                          5242295       4111980  a
www.mattressesworld.co.uk       47        5242300       4111986  b
www.bedroomfurnitureworld.co.uk 47        5242294       4111973  v     

这有3个索引('sitename','ts_placed','order_idno') 其中:

  • 'ts_placed'代表一年中的几周

  • 'sitename'代表网站的名称

  • 'order_idno'订单ID号

和5列('line_idno''item_name','item_qty','item_price','收入')。

通过以下功能将数据分组到多个级别:

grouped = data.groupby(level=['sitename','ts_placed','order_idno']).sum()
grouped0 = grouped.dropna()

结果:

                                                     line_idno  item_qty  \
sitename                        ts_placed order_idno
www.bedroomfurnitureworld.co.uk 38        5156953       3994322         1   
                                          5156956       3994325         1   
                                          5157144       3994580         1   
                                          5157191       3994641         0   
                                          5157198       3994651         1   
                                          5157217       3994678         2   
                                          5157218       3994679         2   
                                          5157233       3994697         1   
                                          5157257       7989463         2   

我希望获得每个网站平均收入的每周细分。换句话说,每个ts_placed / sitename组的所有收入总和除以每个ts_placed的行数。

1 个答案:

答案 0 :(得分:1)

这是一个可重复的例子。

import pandas as pd
import numpy as np

# simulate some artificial data
np.random.seed(1)
sites = ['www.{}.co.uk'.format(x) for x in 'AAA BBB CCC DDD EEE'.split()]
sitename = np.random.choice(sites, size=1000)
ts_placed = np.random.choice(np.arange(47, 53), size=1000)
order_idno = np.random.choice(np.arange(520000, 550000), size=1000)
item_name = np.random.choice('a b c d e f'.split(), size=1000)
line_idno = np.random.choice(np.arange(3960000, 4000000), size=1000)
item_qty = np.random.choice(np.arange(0, 10), size=1000)
item_price = np.random.choice(np.arange(1000, 10000), size=1000)
revenue = item_price * item_qty

data = pd.DataFrame(dict(sitename=sitename, ts_placed=ts_placed, order_idno=order_idno, item_name=item_name, line_idno=line_idno, item_qty=item_qty, item_price=item_price, revenue=revenue)).set_index(['sitename', 'ts_placed', 'order_idno'])

Out[17]: 
                                   item_name  item_price  item_qty  line_idno  revenue
sitename      ts_placed order_idno                                                    
www.DDD.co.uk 47        526418             a        4514         1    3989144     4514
www.EEE.co.uk 52        539155             d        4952         5    3965922    24760
www.AAA.co.uk 52        539417             d        8816         0    3988185        0
www.BBB.co.uk 49        523800             b        3340         3    3981971    10020
www.DDD.co.uk 48        521464             f        4402         6    3976820    26412
www.AAA.co.uk 49        521706             c        8436         5    3963275    42180
              52        544452             c        7220         8    3992357    57760
www.BBB.co.uk 50        548184             d        3389         9    3976608    30501
www.EEE.co.uk 49        527830             f        8110         1    3998527     8110
                        521908             a        7292         4    3964393    29168
www.BBB.co.uk 47        527558             b        4945         6    3977830    29670
www.CCC.co.uk 47        549572             f        3350         5    3988678    16750
www.EEE.co.uk 48        522511             f        1865         0    3992356        0
www.CCC.co.uk 51        520156             e        4717         8    3974344    37736
www.EEE.co.uk 50        534951             b        3738         9    3978519    33642
...                                      ...         ...       ...        ...      ...
www.CCC.co.uk 50        525279             e        5961         0    3980873        0
www.DDD.co.uk 48        539486             c        2028         4    3978442     8112
www.EEE.co.uk 48        543216             e        3721         6    3986919    22326
www.BBB.co.uk 51        525662             c        1264         7    3987129     8848
www.CCC.co.uk 52        546208             e        7287         4    3999828    29148
www.AAA.co.uk 48        544288             a        7708         1    3974546     7708
www.DDD.co.uk 52        538708             f        9080         7    3983499    63560
www.CCC.co.uk 48        536774             a        8971         2    3968092    17942
www.BBB.co.uk 48        528310             c        3284         2    3985896     6568
www.AAA.co.uk 49        549547             c        4265         4    3960981    17060
                        544394             c        2268         8    3982739    18144
              52        540515             f        4476         5    3987786    22380
www.EEE.co.uk 50        540388             f        1226         5    3980156     6130
              47        522633             f        4185         5    3964986    20925
www.CCC.co.uk 49        532710             c        7462         2    3984676    14924

[1000 rows x 5 columns]

# your custom apply funciton
def apply_func(group):
    avg_revenue = group.revenue.mean()
    count_unique_order = len(group.order_idno.unique())
    # or try this
    # count_unique_order = group.order_idno.value_counts().count()
    return pd.Series({'avg_revenue': avg_revenue, 'count_unique_order': count_unique_order})

# use the customized apply funciton
data.reset_index(level='order_idno').dropna().groupby(level=['sitename', 'ts_placed']).apply(apply_func)

   Out[46]: 
                            avg_revenue  count_unique_order
   sitename      ts_placed                                 
   www.AAA.co.uk 47          23501.8158                  10
                 48          23003.9355                  10
                 49          24254.1212                  10
                 50          23254.6410                  10
                 51          19173.8966                  10
                 52          23845.6786                  10
   www.BBB.co.uk 47          26136.0882                  10
                 48          23007.3929                   9
                 49          30605.2857                  10
                 50          19530.3871                  10
                 51          21768.6667                   9
                 52          28387.5455                  10
   www.CCC.co.uk 47          28917.3448                   9
                 48          23659.3488                  10
                 49          26209.0625                   9
                 50          22929.2564                  10
                 51          23474.2857                   9
                 52          22123.3429                  10
   www.DDD.co.uk 47          27176.2778                  10
                 48          24530.6154                  10
                 49          23601.8710                   9
                 50          27749.2162                  10
                 51          26816.0000                   9
                 52          29910.5455                  10
   www.EEE.co.uk 47          27270.6471                  10
                 48          23498.0789                  10
                 49          26682.4250                  10
                 50          24524.4400                  10
                 51          15635.2500                  10
                 52          20917.2500                  10