我正在使用一个包含UPC列,date_expected和已选择数量的数据框。原始数据来自每天多个UPC(每个订单一行,一天中有多个包含同一UPC的订单),但未列出每个UPC的每个日期,仅列出了拣选数量大于0的日期。 目标:组织一个数据框,显示由UPC选择的quantity_picked,然后按date_expected,列出从5/14/19到当前的每个日期,即使quantum_picked = 0(原始数据源中不包含显示quantity_picked = 0的行)。
MFC_order_daily['date_expected'] = pd.to_datetime(MFC_order_daily['date_expected'], format='%Y-%m-%d')
print('Daily Pick Data:')
print(MFC_order_daily)
数据以以下格式出现:
Daily Pick Data:
UPC quantity_picked date_expected
0 0001111041660 1.0 2019-05-14
1 0001111045045 1.0 2019-05-14
... ... ...
39694 0004470036000 6.0 2019-06-24
39695 0007225001116 1.0 2019-06-24
[39696 rows x 3 columns]
尝试使用groupby和reset_index进行组织,如下所示,但收到以下数据框缺少日期,其中number_picked = 0:
tipd = MFC_order_daily.groupby(['UPC', 'date_expected']).sum().reset_index()
tipd = tipd[['UPC','date_expected','quantity_picked']]
print(tipd)
UPC date_expected quantity_picked
0 0000000002554 2019-05-14 4.0
1 0001111041660 2019-05-14 2.0
2 0001111041660 2019-05-16 2.0
3 0004470036000 2019-05-14 3.0
4 0004470036000 2019-05-16 1.0
然后尝试创建一个交叉表以获取零值,并使用堆栈或融合重塑形状。 成功创建交叉表并产生交叉表:
tipd2 = pd.crosstab([MFC_order_daily["UPC"]], MFC_order_daily["date_expected"])
print(tipd2)
date_expected 2019-05-14 2019-05-15 ... 2019-06-23 2019-06-24
UPC ...
0000000002554 0 0 ... 0 0
0000000003082 0 1 ... 2 3
0000000003107 1 0 ... 2 2
... ... ... ... ...
0360600051715 0 0 ... 0 0
0501072452748 0 0 ... 0 0
0880100551750 0 0 ... 0 0
[8302 rows x 42 columns]
尝试堆叠:
tipd2.stack('date_expected')
print('Stacked tipd2:')
print(tipd2)
试图融化:
tipd2.melt(id_vars=['UPC', 'date_expected'])
产生的错误:
KeyError: "The following 'id_vars' are not present in the DataFrame: ['UPC', 'date_expected']"
所需的输出:
UPC date_expected quantity_picked
0 0000000002554 2019-05-14 4.0
1 0000000002554 2019-05-15 0.0
2 0000000002554 2019-05-16 0.0
3 0001111041660 2019-05-14 2.0
4 0001111041660 2019-05-15 0.0
5 0001111041660 2019-05-16 2.0
6 0004470036000 2019-05-14 3.0
7 0004470036000 2019-05-15 0.0
8 0004470036000 2019-05-16 1.0
从19年5月14日开始每个UPC循环浏览每个日期。
答案 0 :(得分:1)
IIUC,您可以使用pivot
和stack
:
# this is after aggregation by `groupby().sum()`
df = pd.DataFrame({'UPC': ['0000000002554', '0001111041660', '0001111041660',
'0004470036000', '0004470036000'],
'date_expected': ['2019-05-14',
'2019-05-14',
'2019-05-16',
'2019-05-14',
'2019-05-16'],
'quantity_picked': [4.0, 2.0, 2.0, 3.0, 1.0]})
(df.pivot_table(index='UPC',
columns='date_expected',
values='quantity_picked',
fill_value=0)
.stack()
.reset_index()
)
输出:
UPC date_expected 0
0 0000000002554 2019-05-14 4
1 0000000002554 2019-05-16 0
2 0001111041660 2019-05-14 2
3 0001111041660 2019-05-16 2
4 0004470036000 2019-05-14 3
5 0004470036000 2019-05-16 1
如果您还想填写日期,那么您可能要看看reindex
。