Question

我目前有一个窗口化时间序列数据的过程，但我想知道是否有出于性能/资源原因的矢量化，就地方法。

我有两个列表，其开头和结束日期为30天窗口：

start_dts = [2014-01-01，...] end_dts = [2014-01-30，...]

我的数据框中包含一个名为＆＃39; transaction_dt＆＃39;。

的字段

我正在尝试完成的是当transaction_dt位于一对＆quot; start_dt＆之间时，向每行添加两个新列（＆＃39; start_dt＆＃39;和＃39; end_dt＆＃39;）的方法＃39;和＆＃39; end_dt＆＃39;值。理想情况下，如果可能的话，这将被矢量化并就地生效。

修改

这里要求的是我格式的一些示例数据：

'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25

Answer 1

IIUC

起诉IntervalIndex

df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both')
df[['End','Start']]=df2.loc[df['transaction_dt']].values


df
Out[457]: 
  transaction_dt        End      Start
0     2017-01-02 2017-01-31 2017-01-01
1     2017-03-02 2017-03-31 2017-03-01
2     2017-04-02 2017-04-30 2017-04-01
3     2017-05-02 2017-05-31 2017-05-01

数据输入：

df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']})
df['transaction_dt']=pd.to_datetime(df['transaction_dt'])
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01']
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31']
df2=pd.DataFrame({'Start':list1,'End':list2})
df2.Start=pd.to_datetime(df2.Start)
df2.End=pd.to_datetime(df2.End)

Answer 2

如果您想要开始和结束，我们可以使用此Extracting the first day of month of a datetime type column in pandas：

import io
import pandas as pd
import datetime

string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25"""

df = pd.read_csv(io.StringIO(string))

df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])

df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1)
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1)

df

返回

customer_id transaction_dt  product price   units   start   end
0   1   2004-01-02  thing1  25  47  2004-01-01  2004-01-31
1   1   2004-01-17  thing2  150 8   2004-01-01  2004-01-31
2   2   2004-01-29  thing2  150 25  2004-01-01  2004-01-31

新方法：

import io
import pandas as pd
import datetime

string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-06-29,thing2,150,25"""

df = pd.read_csv(io.StringIO(string))

df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])

# Get all timestamps that are necessary
# This assumes dates are sorted 
# if not we should change [0] -> min_dt and [-1] --> max_dt
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)]
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]:
    timestamps.append(timestamps[-1]+datetime.timedelta(days=30))

# We store all ranges here
ranges = list(zip(timestamps,timestamps[1:]))

# Loop through all values and add to column start and end
for ind,value in enumerate(df["transaction_dt"]):
    for i,(start,end) in enumerate(ranges):   
        if (value >= start and value <= end):
            df.loc[ind, "start"] = start
            df.loc[ind, "end"] = end
            # When match is found let's  also 
            # remove all ranges that aren't met
            # This can be removed if dates are not sorted
            # But this should speed things up for large datasets
            for _ in range(i):
                ranges.pop(0)

PANDAS时间序列窗口标签

2 个答案: