从满足某些条件的Pandas中的multindex数据帧构造时间序列

时间:2019-08-12 14:58:44

标签: python pandas

假设我有一家比萨店的以下日志:

import pandas as pd

csv = [
    ['2019-05-01', '2019-05-01 18:30', 'pepperoni', 'small'],
    ['2019-05-01', '2019-05-01 21:00', 'pineapple', 'big'],
    ['2019-05-01', '2019-05-01 22:30', 'pepperoni', 'big'],
    ['2019-05-02', '2019-05-02 19:00', 'pineapple', 'small'],
    ['2019-05-02', '2019-05-02 20:30', 'pineapple', 'big'],
    ['2019-05-02', '2019-05-02 23:00', 'pepperoni', 'small']]

df = pd.DataFrame(csv, columns=["Working day", "Time of order", "Pizza type", "Pizza size"])
df["Working day"] = (pd.to_datetime(df["Working day"]))
df["Time of order"] = (pd.to_datetime(df["Time of order"]))
df = df.set_index(['Working day','Time of order'])

现在我有一个multindex数据框,我想进行一些分析。为此,我想基于将某些条件应用于第二个索引(订购时间)或其他列的第一个索引(工作日)来构建时间序列。

例如,一些所需的输出:

每天,最接近19:00:00的订单

                               Pizza type Pizza size
Working day Time of order                            
2019-05-01  2019-05-01 18:30:00  pepperoni      small
2019-05-02  2019-05-02 19:00:00  pineapple      small

每天,19:00:00之后的第一笔订单

                                Pizza type Pizza size
Working day Time of order                            
2019-05-01  2019-05-01 21:00:00  pineapple        big
2019-05-02  2019-05-02 19:00:00  pineapple      small

每天,最新订购的披萨尺寸大:

                                Pizza type Pizza size
Working day Time of order                            
2019-05-01  2019-05-01 22:30:00  pepperoni        big
2019-05-02  2019-05-02 20:30:00  pineapple        big

每天,在22:30:00下订单

                                Pizza type Pizza size
Working day Time of order                            
2019-05-01  2019-05-01 22:30:00  pepperoni        big
2019-05-02  NaT                  NaN              NaN

以此类推。我该怎么做?

1 个答案:

答案 0 :(得分:0)

不使用多索引,而是尝试直接将差异应用于Time of order列:

### Skipping the `df = df.set_index(['Working day','Time of order'])` step:

# Calculate difference to 19:00 by seconds
df['time_difference'] = (df['Time of order'] - pd.to_datetime('19:00')).dt.seconds

.dt方法可用于从pandas datetime对象(https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dt-accessors)中提取信息

计算出差额之后,就可以使用新的time_difference列来回答一些特定的问题。