熊猫的重复行填写日期

时间:2021-03-04 21:21:02

标签: python pandas datetime

我被困在最后一段代码上。
我有一个包含 start_date、end_date 和更多值的数据框。

import pandas as pd
from datetime import datetime

data = {'start_date':['2021-02-09','2021-02-12','2021-02-10','2021-02-09'], 
        'end_date':['2021-02-12','2021-02-13','2021-02-10','2021-02-11'],
        'name':['Fender','Gibson','PRS','Martin']}
source = pd.DataFrame(data)
source.start_date=pd.to_datetime(source.start_date)
source.end_date=pd.to_datetime(source.end_date)
print(source)

   start_date   end_date    name  
 0 2021-02-09 2021-02-12  Fender  
 1 2021-02-12 2021-02-13  Gibson  
 2 2021-02-10 2021-02-10     PRS  
 3 2021-02-09 2021-02-11  Martin  

目标是有一个数据框,其中的日期每天分开:(取决于范围)

         date    name
0  2021-02-09  Fender
1  2021-02-10  Fender
2  2021-02-11  Fender
3  2021-02-12  Fender
4  2021-02-12  Gibson
5  2021-02-13  Gibson
6  2021-02-10     PRS
7  2021-02-09  Martin
8  2021-02-10  Martin
9  2021-02-11  Martin

我需要根据 2 个日期之间的天数将每一行复制 x 次。 并创建一个包含特定日期的新日期列...(稍后我需要过滤掉周六和周日) 这就是我已经拥有的,我想我离得不远了。

def splitspans(df):
    df['number'] = (df.end_date - df.start_date).astype('timedelta64[D]').astype('int')+1
    df = pd.DataFrame(df.values.repeat(df['number'], axis=0), columns=df.columns)
    df = df[['start_date','end_date','name']]
    return df

print(splitspans(source))

  start_date   end_date    name
0 2021-02-09 2021-02-12  Fender
1 2021-02-09 2021-02-12  Fender
2 2021-02-09 2021-02-12  Fender
3 2021-02-09 2021-02-12  Fender
4 2021-02-12 2021-02-13  Gibson
5 2021-02-12 2021-02-13  Gibson
6 2021-02-10 2021-02-10     PRS
7 2021-02-09 2021-02-11  Martin
8 2021-02-09 2021-02-11  Martin
9 2021-02-09 2021-02-11  Martin



2 个答案:

答案 0 :(得分:2)

您可以先为每一行创建一个开始和结束之间的天数列表,然后explode

df = source
df['days'] = (df['end_date'] - df['start_date']).dt.days
df['dates_between'] = df.apply(lambda row: [row['start_date'] + timedelta(days=d) for d in range(row['days']+1)], axis=1)
del df['days']
df.explode('dates_between')

获得


    start_date  end_date    name    dates_between
0   2021-02-09  2021-02-12  Fender  2021-02-09
0   2021-02-09  2021-02-12  Fender  2021-02-10
0   2021-02-09  2021-02-12  Fender  2021-02-11
0   2021-02-09  2021-02-12  Fender  2021-02-12
1   2021-02-12  2021-02-13  Gibson  2021-02-12
1   2021-02-12  2021-02-13  Gibson  2021-02-13
2   2021-02-10  2021-02-10  PRS     2021-02-10
3   2021-02-09  2021-02-11  Martin  2021-02-09
3   2021-02-09  2021-02-11  Martin  2021-02-10
3   2021-02-09  2021-02-11  Martin  2021-02-11

答案 1 :(得分:1)

@piterbarg 的解决方案就是这个的灵感;我使用了 pandas date_range:

(
    source.assign(dates=[ pd.date_range(start, end) 
                          for start, end 
                          in zip(source.start_date, source.end_date)]
                 )
          .explode("dates") # you can pass ignore_index=True here
)

  start_date   end_date    name      dates
0 2021-02-09 2021-02-12  Fender 2021-02-09
0 2021-02-09 2021-02-12  Fender 2021-02-10
0 2021-02-09 2021-02-12  Fender 2021-02-11
0 2021-02-09 2021-02-12  Fender 2021-02-12
1 2021-02-12 2021-02-13  Gibson 2021-02-12
1 2021-02-12 2021-02-13  Gibson 2021-02-13
2 2021-02-10 2021-02-10     PRS 2021-02-10
3 2021-02-09 2021-02-11  Martin 2021-02-09
3 2021-02-09 2021-02-11  Martin 2021-02-10
3 2021-02-09 2021-02-11  Martin 2021-02-11