数据帧中的熊猫累积时间序列范围

时间:2016-08-21 15:32:45

标签: python datetime pandas time-series

我希望根据开始时间和结束列中的值设置“扩展”日期范围。

如果记录的任何部分出现在先前记录中,我想返回一个起始时间,该起始时间是两个开始时间记录中的最小值,以及一个结束时间,它是两个结束时间记录的最大值。

这些将按订单ID

分组
Order starttime             endtime                 RollingStart            RollingEnd
1   2015-07-01 10:24:43.047 2015-07-01 10:24:43.150 2015-07-01 10:24:43.047 2015-07-01 10:24:43.150
1   2015-07-01 10:24:43.137 2015-07-01 10:24:43.200 2015-07-01 10:24:43.047 2015-07-01 10:24:43.200
1   2015-07-01 10:24:43.197 2015-07-01 10:24:57.257 2015-07-01 10:24:43.047 2015-07-01 10:24:57.257
1   2015-07-01 10:24:57.465 2015-07-01 10:25:13.470 2015-07-01 10:24:57.465 2015-07-01 10:25:13.470
1   2015-07-01 10:24:57.730 2015-07-01 10:25:13.485 2015-07-01 10:24:57.465 2015-07-01 10:25:13.485
2   2015-07-01 10:48:57.465 2015-07-01 10:48:13.485 2015-07-01 10:48:57.465 2015-07-01 10:48:13.485

因此,在上面的示例中,订单1的初始范围从2015-07-01 10:24:43.047到2015-07-01 10:24:57.257,然后是2015-07-01的另一个范围10:24:57.465至2015-07-01 10:25:13.485

请注意,虽然开始时间是有序的,但结束时间不一定是由于数据的性质(有短期事件和长期事件)

最后,我只想要每个orderid,滚动开始组合的最后一条记录(所以在这种情况下,最后两条记录

我试过

df['RollingStart'] = np.where((df['endtime'] >= df['RollingStart'].shift()) & (df['RollingEnd'].shift()>= df['starttime']), min(df['starttime'],df['RollingStart']),df['starttime'])

(这显然不包括订单ID)

但我收到的错误是

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

非常感谢任何想法

要复制的代码如下:

from io import StringIO
import io

text = """Order   starttime               endtime
1       2015-07-01 10:24:43.047  2015-07-01 10:24:43.150
1       2015-07-01 10:24:43.137  2015-07-01 10:24:43.200
1       2015-07-01 10:24:43.197  2015-07-01 10:24:57.257
1       2015-07-01 10:24:57.465  2015-07-01 10:25:13.470
1       2015-07-01 10:24:57.730  2015-07-01 10:25:13.485
2       2015-07-01 10:48:57.465  2015-07-01 10:48:13.485"""

df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[1, 2])
df['RollingStart'] = np.where((df['endtime'] >= df['RollingStart'].shift()) & (df['RollingEnd'].shift()>= df['start']), min(df['starttime'],df['RollingStart']),df['starttime'])




df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[1, 2])


df['RollingStart']=df['starttime']
df['RollingEnd']=df['endtime']
df['RollingStart'] = 
np.where((df['endtime'] >= df['RollingStart'].shift()) & (df['RollingEnd'].shift()>= df['starttime']),min(df['starttime'],df['RollingStart']),df['starttime'])

错误是:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 731, in     __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

由于

2 个答案:

答案 0 :(得分:1)

看起来您正在尝试根据尚未设置的值返回值

df['start'] =...conditions... df['start'].shift()

在我看来,你正试图在Pandas不知道任何事情的专栏上设置一个条件。

如果您只是尝试将“start”值设置为这些列中的最新时间,请尝试使用或语句构建语句,或者创建一个临时数组并使用max,如果您只是想获取最新的时间

df['start'] = np.where(max(df['enddatetime'],df['startdatetime'],))

如果上面的方法是关闭的,你有代码重现这个df所以我可以看到我是否得到同样的错误?

答案 1 :(得分:0)

试试这个:

版本1

NaT = pd.NaT
df['Rolling2']     = np.where(df['starttime'].shift(-1) > df['endtime'], NaT,'drop')
df['Rolling2']     = df['Rolling2'].shift(1)
df['RollingStart'] = np.where(df['Rolling2']  =='drop',None,df['starttime'])
df['RollingStart'] = pd.to_datetime(df['RollingStart']).ffill()
df['RollingEnd']   = df['endtime']
del df['Rolling2']

第2版。

df['RollingStart'] = df['starttime']
df['RollingEnd']   = df['endtime']
df['RollingStart'] = np.where(df['RollingEnd'].shift()>= df['starttime'] ,pd.NaT , df['RollingStart'])
df['RollingStart'] = pd.to_datetime(df['RollingStart']).ffill()


  Order               starttime                 endtime            RollingStart              RollingEnd
0      1 2015-07-01 10:24:43.047 2015-07-01 10:24:43.150 2015-07-01 10:24:43.047 2015-07-01 10:24:43.150
1      1 2015-07-01 10:24:43.137 2015-07-01 10:24:43.200 2015-07-01 10:24:43.047 2015-07-01 10:24:43.200
2      1 2015-07-01 10:24:43.197 2015-07-01 10:24:57.257 2015-07-01 10:24:43.047 2015-07-01 10:24:57.257
3      1 2015-07-01 10:24:57.465 2015-07-01 10:25:13.470 2015-07-01 10:24:57.465 2015-07-01 10:25:13.470
4      1 2015-07-01 10:24:57.730 2015-07-01 10:25:13.485 2015-07-01 10:24:57.465 2015-07-01 10:25:13.485
5      2 2015-07-01 10:48:57.465 2015-07-01 10:48:13.485 2015-07-01 10:48:57.465 2015-07-01 10:48:13.485