python

时间:2016-08-06 06:21:18

标签: python python-3.x pandas

每周(比如5周)销售和产品和商店组合的篮子,我想查找产品的总支出和访问次数(对于特定的周(比如说201520)"即2015年的第20周。在我选择一周的那一刻,可能会有一些产品在那一周没有销售。但我不希望通过将其从我的群组中移除。基本上我想要在5周内销售所有产品,但如果我在上面选择的那一周没有销售产品,我希望它出现在我的最终dataFrame中,汇总数字为0.样本数据。(让假设产品122未在201520周销售

prod store week    baskets sales
123  112   201518  20      100.45
123  112   201519  21      89.65
123  112   201520  22      1890.54
122  112   201518  10      909.99

样本输出(201520)

prod total_baskets   total_sales  spend_per_basket
123  22              1890.54      85.93363636
122  0               0            0

我知道这可以使用pandas使用 groupby 来完成。但我正在做多个步骤。我正在寻找更加pythonic和有效的方式。目前
我首先选择我正在进行组合的那一周 然后创建我的初始每周数据集中存在的所有产品的列表 然后通过数据重新回到小组。我发现这种效率不高。请帮忙。还需要创建每篮子花费。如果total_baskets> 0然后spend_per_basket是total_sales / total_baskets。别的0 TIA。  虚拟代码:

trans_for_my_week=weekly_trans[weekly_trans['week']==201520]    
avg_sales=pd.DataFrame(trans_for_my_week.groupby(['prod']).agg({'baskets':      {'total_baskets':'sum'},
                                                 'sales' :{'total_sales':'sum'}}))
avg_sales_period_0.columns=avg_sales_period_0.columns.droplevel(0)
avg_sales_period_0=avg_sales_period_0.reset_index()

等等

使用下面提供的解决方案:在编写以下代码时,我收到一些错误:

x=round(res.sales / res.baskets,4)
x.columns = pd.MultiIndex.from_product(['spend_per_basket', res.columns.get_level_values(1).drop_duplicates()])

打印(x)的

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-fbb15ec86cc6> in <module>()
      7 
      8 x=round(res.sales / res.baskets,4)
----> 9 x.columns = pd.MultiIndex.from_product(['spend_per_basket', res.columns.get_level_values(1).drop_duplicates()])
     10 print(x)

/usr/lib64/python3.4/site-packages/pandas/indexes/multi.py in from_product(cls, iterables, sortorder, names)
   1022         from pandas.tools.util import cartesian_product
   1023 
-> 1024         labels, levels = _factorize_from_iterables(iterables)
   1025         labels = cartesian_product(labels)
   1026 

/usr/lib64/python3.4/site-packages/pandas/core/categorical.py in _factorize_from_iterables(iterables)
   2066         # For consistency, it should return a list of 2 lists.
   2067         return [[], []]
-> 2068     return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))

/usr/lib64/python3.4/site-packages/pandas/core/categorical.py in <listcomp>(.0)
   2066         # For consistency, it should return a list of 2 lists.
   2067         return [[], []]
-> 2068     return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))

/usr/lib64/python3.4/site-packages/pandas/core/categorical.py in _factorize_from_iterable(values)
   2028 
   2029     if not is_list_like(values):
-> 2030         raise TypeError("Input must be list-like")
   2031 
   2032     if is_categorical(values):

TypeError: Input must be list-like

3 个答案:

答案 0 :(得分:3)

您也可以使用pivot_table获得所需内容,虽然它的方法略有不同,但您正在寻找单行代码

print(pd.pivot_table(df, index = 'week', columns = 'prod', values = 'sales', aggfunc = 'sum').fillna(0))

输出:

prod       122      123
week                   
201518  909.99   100.45
201519    0.00    89.65
201520    0.00  1890.54

答案 1 :(得分:2)

UPDATE2:添加新的计算多级列:

set

PS here is very well documented multiindexing (multi-level) pandas techniques with lots of examples

@JoeR's solution启发的

更新: - 此处已修改为In [8]: x = res.sales / res.baskets In [9]: x Out[9]: week 201518 201519 201520 prod 122 90.9990 NaN NaN 123 5.0225 4.269048 85.933636 In [10]: x.columns = pd.MultiIndex.from_product([['spend_per_basket'], res.columns.get_level_values(1).drop_duplicates()]) In [11]: x Out[11]: spend_per_basket 201518 201519 201520 prod 122 90.9990 NaN NaN 123 5.0225 4.269048 85.933636 In [12]: res = res.join(x) In [13]: res Out[13]: baskets sales spend_per_basket week 201518 201519 201520 201518 201519 201520 201518 201519 201520 prod 122 10 0 0 909.99 0.00 0.00 90.9990 NaN NaN 123 20 21 22 100.45 89.65 1890.54 5.0225 4.269048 85.933636 版本:

pivot_table()

您还可以按如下方式展平列级别:

res = df.pivot_table(index='prod', columns='week', values=['baskets','sales'], aggfunc='sum', fill_value=0)

In [189]: res
Out[189]:
     baskets                 sales
week  201518 201519 201520  201518 201519   201520
prod
122       10      0      0  909.99   0.00     0.00
123       20     21     22  100.45  89.65  1890.54

In [190]: res[[('baskets',201519)]]
Out[190]:
     baskets
week  201519
prod
122        0
123       21

In [192]: res.ix[122, [('sales',201519)]]
Out[192]:
       week
sales  201519    0.0
Name: 122, dtype: float64

但我会将其保留为多级列,以便您可以使用高级索引(如上例所示)

OLD回答:

我为您的所有数据计算一次:

In [194]: res2 = res.copy()

In [196]: res2.columns = ['{0[0]}_{0[1]}'.format(col) for col in res2.columns]

In [197]: res2
Out[197]:
      baskets_201518  baskets_201519  baskets_201520  sales_201518  sales_201519  sales_201520
prod
122               10               0               0        909.99          0.00          0.00
123               20              21              22        100.45         89.65       1890.54

答案 2 :(得分:0)

我认为最简单,更“Pythonic”的解决方案至少涉及两个步骤:groupby,然后是合并。你可以这样做:

# First create a container DataFrame to hold the data:
columns = pd.MultiIndex.from_arrays([['a', 'b'], df[0].unique()])
output = pd.DataFrame(columns=columns)

# Then the groupby magic
agg_sales = weekly_trans.groupby(['week','prod']).agg({'baskets' : {'total_baskets':'sum'},
                                                       'sales'   : {'total_sales'  :'sum'}})
agg_sales = agg_sales.unstack() # This will set your 'prod' as columns
output = pd.concat([output, agg_sales], axis=0)

# And you can do that in one line, if you need to:
output = pd.concat([output, weekly_trans.groupby(['week','prod']).\
                       agg({'baskets' : {'total_baskets':'sum'},
                            'sales'   : {'total_sales'  :'sum'}}).\
                           unstack()], axis=0)