PANDAS时间序列窗口标签

时间:2017-10-05 16:10:53

标签: python pandas time-series

我目前有一个窗口化时间序列数据的过程,但我想知道是否有出于性能/资源原因的矢量化,就地方法。

我有两个列表,其开头和结束日期为30天窗口:

start_dts = [2014-01-01,...] end_dts = [2014-01-30,...]

我的数据框中包含一个名为' transaction_dt'。

的字段

我正在尝试完成的是当transaction_dt位于一对" start_dt&之间时,向每行添加两个新列(' start_dt'和#39; end_dt')的方法#39;和' end_dt'值。理想情况下,如果可能的话,这将被矢量化并就地生效。

修改

这里要求的是我格式的一些示例数据:

'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25

2 个答案:

答案 0 :(得分:0)

IIUC

起诉IntervalIndex

df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both')
df[['End','Start']]=df2.loc[df['transaction_dt']].values


df
Out[457]: 
  transaction_dt        End      Start
0     2017-01-02 2017-01-31 2017-01-01
1     2017-03-02 2017-03-31 2017-03-01
2     2017-04-02 2017-04-30 2017-04-01
3     2017-05-02 2017-05-31 2017-05-01

数据输入:

df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']})
df['transaction_dt']=pd.to_datetime(df['transaction_dt'])
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01']
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31']
df2=pd.DataFrame({'Start':list1,'End':list2})
df2.Start=pd.to_datetime(df2.Start)
df2.End=pd.to_datetime(df2.End)

答案 1 :(得分:0)

如果您想要开始和结束,我们可以使用此Extracting the first day of month of a datetime type column in pandas

import io
import pandas as pd
import datetime

string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25"""

df = pd.read_csv(io.StringIO(string))

df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])

df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1)
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1)

df

返回

customer_id transaction_dt  product price   units   start   end
0   1   2004-01-02  thing1  25  47  2004-01-01  2004-01-31
1   1   2004-01-17  thing2  150 8   2004-01-01  2004-01-31
2   2   2004-01-29  thing2  150 25  2004-01-01  2004-01-31

新方法

import io
import pandas as pd
import datetime

string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-06-29,thing2,150,25"""

df = pd.read_csv(io.StringIO(string))

df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])

# Get all timestamps that are necessary
# This assumes dates are sorted 
# if not we should change [0] -> min_dt and [-1] --> max_dt
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)]
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]:
    timestamps.append(timestamps[-1]+datetime.timedelta(days=30))

# We store all ranges here
ranges = list(zip(timestamps,timestamps[1:]))

# Loop through all values and add to column start and end
for ind,value in enumerate(df["transaction_dt"]):
    for i,(start,end) in enumerate(ranges):   
        if (value >= start and value <= end):
            df.loc[ind, "start"] = start
            df.loc[ind, "end"] = end
            # When match is found let's  also 
            # remove all ranges that aren't met
            # This can be removed if dates are not sorted
            # But this should speed things up for large datasets
            for _ in range(i):
                ranges.pop(0)