Question

我有以下 df

import pandas as pd
foo = pd.DataFrame({'start_date':['2019-09-30', '2020-01-01', '2020-01-02', '2020-02-02'], 
                    'end_date': ['2019-10-13', '2020-01-30', '2020-01-03', '2020-03-03'], 
                    'index': [1, 1, 3, 4],
                    'quantity': [100, 200, 113, 3123]})

我希望将此 df 从开始日期和结束日期转换为带有日期列的单列，并均匀地拆分数量。

我目前拥有且有效的方法是：

import datetime as dt
df = pd.DataFrame()
count = 0
foo['start_date'] = pd.to_datetime(foo['start_date'], errors='coerce')
foo['end_date'] = pd.to_datetime(foo['end_date'], errors='coerce')

for i, row in foo.iterrows():
    count = count + 1
    start = row['start_date']
    end = row['end_date']
    span = end - start + dt.timedelta(days=1)
        
    #loop through start and end dates. 
    for d in range(span.days):
        day = start + dt.timedelta(days=d)
        df = df.append({'index': row['index'],
                        'date': day,
                         'quantity': row['quantity'] / span.days,
                         'line_item': count},
                          ignore_index=True)

print(df)
    date  index  line_item    quantity
    0  2019-09-30    1.0        1.0    7.142857
    1  2019-10-01    1.0        1.0    7.142857
    2  2019-10-02    1.0        1.0    7.142857
    3  2019-10-03    1.0        1.0    7.142857
    4  2019-10-04    1.0        1.0    7.142857
    ..        ...    ...        ...         ...
    72 2020-02-28    4.0        4.0  100.741935
    73 2020-02-29    4.0        4.0  100.741935
    74 2020-03-01    4.0        4.0  100.741935
    75 2020-03-02    4.0        4.0  100.741935
    76 2020-03-03    4.0        4.0  100.741935

正如你可以想象，当我使用的初始 df 是几万行时，这个过程很快就会变得非常缓慢。由于开始日期和结束日期不遵循特定模式，因此无法使用我目前看到的方法，爆炸或重新采样

我也尝试过一些小的调整，比如预先进行除法和计算天数——但是，我没有看到所需时间明显减少。

Answer 1

这行得通：

df = foo.copy() # copy foo to new df
df['date'] = [pd.date_range(s, e, freq='d') for s, e in zip(pd.to_datetime(df['start_date']), pd.to_datetime(df['end_date']))] # create DatetimeIndex from start date and end date
df['quantity'] = df.apply(lambda x: x.quantity/len(x.date), axis=1) # calculate quantity per date 
df = df.explode('date').drop(['start_date', 'end_date'], axis=1) # explode by date

输出df.head()：

<头>

index	数量	日期
1	7.14286	2019-09-30 00:00:00
1	7.14286	2019-10-01 00:00:00
1	7.14286	2019-10-02 00:00:00
1	7.14286	2019-10-03 00:00:00
1	7.14286	2019-10-04 00:00:00

根据相对值/日期有效地分割熊猫行

1 个答案: