在熊猫数据框的日期范围内标记日期时间列表

时间:2019-07-31 08:28:50

标签: python pandas

我环顾四周(例如 Python - Locating the closest timestamp),但找不到任何内容。

我有一个日期时间列表,以及一个包含10k +行的开始和结束时间(格式为日期时间)的数据框。

数据框有效地列出了仪器运行的参数。

该列表描述了警报事件发生的时间。

日期时间列表项都位于数据框中的一行内(即开始时间和结束时间之间)。有没有一种简单的方法来定位包含警报时间所在时间范围的行? (很抱歉那里的措辞不好!)

例如

for i in alarms:
    df.loc[(df.start_time < i) & (df.end_time > i), 'Flag'] = 'Alarm'

(这不起作用,但显示了我的方法)

示例数据集

# making list of datetimes for the alarms

df = pd.DataFrame({'Alarms':["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})

df['Alarms'] = pd.to_datetime(df['Alarms'])

alarms = list(df.Alarms.unique())

# dataframe of runs containing start and end times
n=33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({ 'start_date': rng1, 'end_Date': rng2})

在此,标志将与第4行,第13行和第21行(井,索引)相反。

2 个答案:

答案 0 :(得分:4)

您可以在此处使用pandas.IntervalIndex

# Create and set IntervalIndex
intervals = pd.IntervalIndex.from_arrays(df.start_date, df.end_Date)
df = df.set_index(intervals)

# Update using loc
df.loc[alarms, 'flag'] = 'alarm'

# Finally, reset_index
df = df.reset_index(drop=True)

[出]

            start_date            end_Date   flag
0  2019-07-18 00:00:00 2019-07-18 03:00:00    NaN
1  2019-07-18 03:00:00 2019-07-18 06:00:00    NaN
2  2019-07-18 06:00:00 2019-07-18 09:00:00    NaN
3  2019-07-18 09:00:00 2019-07-18 12:00:00    NaN
4  2019-07-18 12:00:00 2019-07-18 15:00:00  alarm
5  2019-07-18 15:00:00 2019-07-18 18:00:00    NaN
6  2019-07-18 18:00:00 2019-07-18 21:00:00    NaN
7  2019-07-18 21:00:00 2019-07-19 00:00:00    NaN
8  2019-07-19 00:00:00 2019-07-19 03:00:00    NaN
9  2019-07-19 03:00:00 2019-07-19 06:00:00    NaN
10 2019-07-19 06:00:00 2019-07-19 09:00:00    NaN
11 2019-07-19 09:00:00 2019-07-19 12:00:00    NaN
12 2019-07-19 12:00:00 2019-07-19 15:00:00    NaN
13 2019-07-19 15:00:00 2019-07-19 18:00:00  alarm
14 2019-07-19 18:00:00 2019-07-19 21:00:00    NaN
15 2019-07-19 21:00:00 2019-07-20 00:00:00    NaN
16 2019-07-20 00:00:00 2019-07-20 03:00:00    NaN
17 2019-07-20 03:00:00 2019-07-20 06:00:00    NaN
18 2019-07-20 06:00:00 2019-07-20 09:00:00    NaN
19 2019-07-20 09:00:00 2019-07-20 12:00:00    NaN
20 2019-07-20 12:00:00 2019-07-20 15:00:00    NaN
21 2019-07-20 15:00:00 2019-07-20 18:00:00  alarm
22 2019-07-20 18:00:00 2019-07-20 21:00:00    NaN
23 2019-07-20 21:00:00 2019-07-21 00:00:00    NaN
24 2019-07-21 00:00:00 2019-07-21 03:00:00    NaN
25 2019-07-21 03:00:00 2019-07-21 06:00:00    NaN
26 2019-07-21 06:00:00 2019-07-21 09:00:00    NaN
27 2019-07-21 09:00:00 2019-07-21 12:00:00    NaN
28 2019-07-21 12:00:00 2019-07-21 15:00:00    NaN
29 2019-07-21 15:00:00 2019-07-21 18:00:00    NaN
30 2019-07-21 18:00:00 2019-07-21 21:00:00    NaN
31 2019-07-21 21:00:00 2019-07-22 00:00:00    NaN
32 2019-07-22 00:00:00 2019-07-22 03:00:00    NaN

答案 1 :(得分:1)

您曾将自己的列称为start_dateend_Date,但是在您的列中使用start_timeend_time

尝试一下:

import pandas as pd

df = pd.DataFrame({'Alarms': ["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})

df['Alarms'] = pd.to_datetime(df['Alarms'])

alarms = list(df.Alarms.unique())

# dataframe of runs containing start and end times
n = 33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({'start_date': rng1, 'end_Date': rng2})


for i in alarms:
    df.loc[(df.start_date < i) & (df.end_Date > i), 'Flag'] = 'Alarm'
print(df[df['Flag']=='Alarm']['Flag'])

输出:

4     Alarm
13    Alarm
21    Alarm
Name: Flag, dtype: object