我每周(比如5周)销售和产品和商店组合的篮子,我想查找产品的总支出和访问次数(对于特定的周(比如说201520)"即2015年的第20周。在我选择一周的那一刻,可能会有一些产品在那一周没有销售。但我不希望通过将其从我的群组中移除。基本上我想要在5周内销售所有产品,但如果我在上面选择的那一周没有销售产品,我希望它出现在我的最终dataFrame中,汇总数字为0.样本数据。(让假设产品122未在201520周销售)
prod store week baskets sales
123 112 201518 20 100.45
123 112 201519 21 89.65
123 112 201520 22 1890.54
122 112 201518 10 909.99
样本输出(201520)
prod total_baskets total_sales spend_per_basket
123 22 1890.54 85.93363636
122 0 0 0
我知道这可以使用pandas使用 groupby 来完成。但我正在做多个步骤。我正在寻找更加pythonic和有效的方式。目前
我首先选择我正在进行组合的那一周
然后创建我的初始每周数据集中存在的所有产品的列表
然后通过数据重新回到小组。我发现这种效率不高。请帮忙。还需要创建每篮子花费。如果total_baskets> 0然后spend_per_basket是total_sales / total_baskets。别的0 TIA。
虚拟代码:
trans_for_my_week=weekly_trans[weekly_trans['week']==201520]
avg_sales=pd.DataFrame(trans_for_my_week.groupby(['prod']).agg({'baskets': {'total_baskets':'sum'},
'sales' :{'total_sales':'sum'}}))
avg_sales_period_0.columns=avg_sales_period_0.columns.droplevel(0)
avg_sales_period_0=avg_sales_period_0.reset_index()
等等
使用下面提供的解决方案:在编写以下代码时,我收到一些错误:
x=round(res.sales / res.baskets,4)
x.columns = pd.MultiIndex.from_product(['spend_per_basket', res.columns.get_level_values(1).drop_duplicates()])
打印(x)的
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-38-fbb15ec86cc6> in <module>()
7
8 x=round(res.sales / res.baskets,4)
----> 9 x.columns = pd.MultiIndex.from_product(['spend_per_basket', res.columns.get_level_values(1).drop_duplicates()])
10 print(x)
/usr/lib64/python3.4/site-packages/pandas/indexes/multi.py in from_product(cls, iterables, sortorder, names)
1022 from pandas.tools.util import cartesian_product
1023
-> 1024 labels, levels = _factorize_from_iterables(iterables)
1025 labels = cartesian_product(labels)
1026
/usr/lib64/python3.4/site-packages/pandas/core/categorical.py in _factorize_from_iterables(iterables)
2066 # For consistency, it should return a list of 2 lists.
2067 return [[], []]
-> 2068 return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
/usr/lib64/python3.4/site-packages/pandas/core/categorical.py in <listcomp>(.0)
2066 # For consistency, it should return a list of 2 lists.
2067 return [[], []]
-> 2068 return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
/usr/lib64/python3.4/site-packages/pandas/core/categorical.py in _factorize_from_iterable(values)
2028
2029 if not is_list_like(values):
-> 2030 raise TypeError("Input must be list-like")
2031
2032 if is_categorical(values):
TypeError: Input must be list-like
答案 0 :(得分:3)
您也可以使用pivot_table获得所需内容,虽然它的方法略有不同,但您正在寻找单行代码:
print(pd.pivot_table(df, index = 'week', columns = 'prod', values = 'sales', aggfunc = 'sum').fillna(0))
输出:
prod 122 123
week
201518 909.99 100.45
201519 0.00 89.65
201520 0.00 1890.54
答案 1 :(得分:2)
UPDATE2:添加新的计算多级列:
set
PS here is very well documented multiindexing (multi-level) pandas techniques with lots of examples
受@JoeR's solution启发的 更新: - 此处已修改为In [8]: x = res.sales / res.baskets
In [9]: x
Out[9]:
week 201518 201519 201520
prod
122 90.9990 NaN NaN
123 5.0225 4.269048 85.933636
In [10]: x.columns = pd.MultiIndex.from_product([['spend_per_basket'], res.columns.get_level_values(1).drop_duplicates()])
In [11]: x
Out[11]:
spend_per_basket
201518 201519 201520
prod
122 90.9990 NaN NaN
123 5.0225 4.269048 85.933636
In [12]: res = res.join(x)
In [13]: res
Out[13]:
baskets sales spend_per_basket
week 201518 201519 201520 201518 201519 201520 201518 201519 201520
prod
122 10 0 0 909.99 0.00 0.00 90.9990 NaN NaN
123 20 21 22 100.45 89.65 1890.54 5.0225 4.269048 85.933636
版本:
pivot_table()
您还可以按如下方式展平列级别:
res = df.pivot_table(index='prod', columns='week', values=['baskets','sales'], aggfunc='sum', fill_value=0)
In [189]: res
Out[189]:
baskets sales
week 201518 201519 201520 201518 201519 201520
prod
122 10 0 0 909.99 0.00 0.00
123 20 21 22 100.45 89.65 1890.54
In [190]: res[[('baskets',201519)]]
Out[190]:
baskets
week 201519
prod
122 0
123 21
In [192]: res.ix[122, [('sales',201519)]]
Out[192]:
week
sales 201519 0.0
Name: 122, dtype: float64
但我会将其保留为多级列,以便您可以使用高级索引(如上例所示)
OLD回答:
我为您的所有数据计算一次:
In [194]: res2 = res.copy()
In [196]: res2.columns = ['{0[0]}_{0[1]}'.format(col) for col in res2.columns]
In [197]: res2
Out[197]:
baskets_201518 baskets_201519 baskets_201520 sales_201518 sales_201519 sales_201520
prod
122 10 0 0 909.99 0.00 0.00
123 20 21 22 100.45 89.65 1890.54
答案 2 :(得分:0)
我认为最简单,更“Pythonic”的解决方案至少涉及两个步骤:groupby,然后是合并。你可以这样做:
# First create a container DataFrame to hold the data:
columns = pd.MultiIndex.from_arrays([['a', 'b'], df[0].unique()])
output = pd.DataFrame(columns=columns)
# Then the groupby magic
agg_sales = weekly_trans.groupby(['week','prod']).agg({'baskets' : {'total_baskets':'sum'},
'sales' : {'total_sales' :'sum'}})
agg_sales = agg_sales.unstack() # This will set your 'prod' as columns
output = pd.concat([output, agg_sales], axis=0)
# And you can do that in one line, if you need to:
output = pd.concat([output, weekly_trans.groupby(['week','prod']).\
agg({'baskets' : {'total_baskets':'sum'},
'sales' : {'total_sales' :'sum'}}).\
unstack()], axis=0)