使用堆栈或合并在交叉表中重塑数据

时间:2019-07-03 17:32:47

标签: python pandas dataframe

我正在使用一个包含UPC列,date_expected和已选择数量的数据框。原始数据来自每天多个UPC(每个订单一行,一天中有多个包含同一UPC的订单),但未列出每个UPC的每个日期,仅列出了拣选数量大于0的日期。 目标:组织一个数据框,显示由UPC选择的quantity_picked,然后按date_expected,列出从5/14/19到当前的每个日期,即使quantum_picked = 0(原始数据源中不包含显示quantity_picked = 0的行)。

MFC_order_daily['date_expected'] = pd.to_datetime(MFC_order_daily['date_expected'], format='%Y-%m-%d')
print('Daily Pick Data:')
print(MFC_order_daily)

数据以以下格式出现:

Daily Pick Data:
                 UPC  quantity_picked date_expected
0      0001111041660              1.0    2019-05-14
1      0001111045045              1.0    2019-05-14
             ...              ...           ...
39694  0004470036000              6.0    2019-06-24
39695  0007225001116              1.0    2019-06-24

[39696 rows x 3 columns]

尝试使用groupby和reset_index进行组织,如下所示,但收到以下数据框缺少日期,其中number_picked = 0:

tipd = MFC_order_daily.groupby(['UPC', 'date_expected']).sum().reset_index()
tipd = tipd[['UPC','date_expected','quantity_picked']]
print(tipd)
                 UPC date_expected  quantity_picked
0      0000000002554    2019-05-14              4.0
1      0001111041660    2019-05-14              2.0
2      0001111041660    2019-05-16              2.0
3      0004470036000    2019-05-14              3.0
4      0004470036000    2019-05-16              1.0

然后尝试创建一个交叉表以获取零值,并使用堆栈或融合重塑形状。 成功创建交叉表并产生交叉表:

tipd2 = pd.crosstab([MFC_order_daily["UPC"]], MFC_order_daily["date_expected"])
print(tipd2)
date_expected  2019-05-14  2019-05-15  ...  2019-06-23  2019-06-24
UPC                                    ...                        
0000000002554           0           0  ...           0           0
0000000003082           0           1  ...           2           3
0000000003107           1           0  ...           2           2
                  ...         ...  ...         ...         ...
0360600051715           0           0  ...           0           0
0501072452748           0           0  ...           0           0
0880100551750           0           0  ...           0           0

[8302 rows x 42 columns]

尝试堆叠:

tipd2.stack('date_expected')
print('Stacked tipd2:')
print(tipd2)

结果数据与上面显示的交叉表相同,没有变化,没有错误。

试图融化:

tipd2.melt(id_vars=['UPC', 'date_expected'])

产生的错误:

KeyError: "The following 'id_vars' are not present in the DataFrame: ['UPC', 'date_expected']"

所需的输出:

                 UPC date_expected  quantity_picked
0      0000000002554    2019-05-14              4.0
1      0000000002554    2019-05-15              0.0
2      0000000002554    2019-05-16              0.0
3      0001111041660    2019-05-14              2.0
4      0001111041660    2019-05-15              0.0
5      0001111041660    2019-05-16              2.0
6      0004470036000    2019-05-14              3.0
7      0004470036000    2019-05-15              0.0
8      0004470036000    2019-05-16              1.0

从19年5月14日开始每个UPC循环浏览每个日期。

1 个答案:

答案 0 :(得分:1)

IIUC,您可以使用pivotstack

# this is after aggregation by `groupby().sum()`
df = pd.DataFrame({'UPC': ['0000000002554', '0001111041660', '0001111041660', 
                           '0004470036000', '0004470036000'],
 'date_expected': ['2019-05-14',
  '2019-05-14',
  '2019-05-16',
  '2019-05-14',
  '2019-05-16'],
 'quantity_picked': [4.0, 2.0, 2.0, 3.0, 1.0]})


(df.pivot_table(index='UPC', 
          columns='date_expected', 
          values='quantity_picked',
          fill_value=0)
   .stack()
   .reset_index()
)

输出:

             UPC date_expected  0
0  0000000002554    2019-05-14  4
1  0000000002554    2019-05-16  0
2  0001111041660    2019-05-14  2
3  0001111041660    2019-05-16  2
4  0004470036000    2019-05-14  3
5  0004470036000    2019-05-16  1

如果您还想填写日期,那么您可能要看看reindex