我在熊猫中有以下数据框
code tank date time no_operation_flag
123 1 01-01-2019 00:00:00 1
123 1 01-01-2019 00:30:00 1
123 1 01-01-2019 01:00:00 0
123 1 01-01-2019 01:30:00 1
123 1 01-01-2019 02:00:00 1
123 1 01-01-2019 02:30:00 1
123 1 01-01-2019 03:00:00 1
123 1 01-01-2019 03:30:00 1
123 1 01-01-2019 04:00:00 1
123 1 01-01-2019 05:00:00 1
123 1 01-01-2019 14:00:00 1
123 1 01-01-2019 14:30:00 1
123 1 01-01-2019 15:00:00 1
123 1 01-01-2019 15:30:00 1
123 1 01-01-2019 16:00:00 1
123 1 01-01-2019 16:30:00 1
123 2 02-01-2019 00:00:00 1
123 2 02-01-2019 00:30:00 0
123 2 02-01-2019 01:00:00 0
123 2 02-01-2019 01:30:00 0
123 2 02-01-2019 02:00:00 1
123 2 02-01-2019 02:30:00 1
123 2 02-01-2019 03:00:00 1
123 2 03-01-2019 03:30:00 1
123 2 03-01-2019 04:00:00 1
123 1 03-01-2019 14:00:00 1
123 2 03-01-2019 15:00:00 1
123 2 03-01-2019 00:30:00 1
123 2 04-01-2019 11:00:00 1
123 2 04-01-2019 11:30:00 0
123 2 04-01-2019 12:00:00 1
123 2 04-01-2019 13:30:00 1
123 2 05-01-2019 03:00:00 1
123 2 05-01-2019 03:30:00 1
123 2 05-01-2019 04:00:00 1
我想做的是在no_operation_flag
中将连续的1标记为在储罐水平和日间水平上超过5次,但是时间应该是连续的(时间在半小时内)。数据框已按储罐,日期和时间级别分类。
我想要的数据框是
code tank date time no_operation_flag final_flag
123 1 01-01-2019 00:00:00 1 0
123 1 01-01-2019 00:30:00 1 0
123 1 01-01-2019 01:00:00 0 0
123 1 01-01-2019 01:30:00 1 1
123 1 01-01-2019 02:00:00 1 1
123 1 01-01-2019 02:30:00 1 1
123 1 01-01-2019 03:00:00 1 1
123 1 01-01-2019 03:30:00 1 1
123 1 01-01-2019 04:00:00 1 1
123 1 01-01-2019 05:00:00 1 0
123 1 01-01-2019 14:00:00 1 1
123 1 01-01-2019 14:30:00 1 1
123 1 01-01-2019 15:00:00 1 1
123 1 01-01-2019 15:30:00 1 1
123 1 01-01-2019 16:00:00 1 1
123 1 01-01-2019 16:30:00 1 1
123 2 02-01-2019 00:00:00 1 0
123 2 02-01-2019 00:30:00 0 0
123 2 02-01-2019 01:00:00 0 0
123 2 02-01-2019 01:30:00 0 0
123 2 02-01-2019 02:00:00 1 0
123 2 02-01-2019 02:30:00 1 0
123 2 02-01-2019 03:00:00 1 0
123 2 03-01-2019 03:30:00 1 0
123 2 03-01-2019 04:00:00 1 0
123 1 03-01-2019 14:00:00 1 0
123 2 03-01-2019 15:00:00 1 0
123 2 03-01-2019 00:30:00 1 0
123 2 04-01-2019 11:00:00 1 0
123 2 04-01-2019 11:30:00 0 0
123 2 04-01-2019 12:00:00 1 0
123 2 04-01-2019 13:30:00 1 0
123 2 05-01-2019 03:00:00 1 0
123 2 05-01-2019 03:30:00 1 0
123 2 05-01-2019 04:00:00 1 0
如何在熊猫中做到这一点?
答案 0 :(得分:2)
您可以使用this之类的解决方案,仅使用新助手jingle.wav
过滤每个组的连续日期时间,并添加所有缺少的日期时间,最后DataFrame
用于添加新列:
merge
df['datetimes'] = pd.to_datetime(df['date'].astype(str) + ' ' + df['time'].astype(str))
df1 = (df.set_index('datetimes')
.groupby(['code','tank', 'date'])['no_operation_flag']
.resample('30T')
.first()
.reset_index())
shifted1 = df1.groupby(['code','tank', 'date'])['no_operation_flag'].shift()
g1 = df1['no_operation_flag'].ne(shifted1).cumsum()
mask1 = g1.map(g1.value_counts()).gt(5) & df1['no_operation_flag'].eq(1)
df1['final_flag'] = mask1.astype(int)
#print (df1.head(40))
df = df.merge(df1[['code','tank','datetimes','final_flag']]).drop('datetimes', axis=1)
答案 1 :(得分:2)
使用:
df['final_flag'] = ( df.groupby([df['no_operation_flag'].ne(1).cumsum(),
'tank',
'date',
pd.to_datetime(df['time'].astype(str))
.diff()
.ne(pd.Timedelta(minutes = 30))
.cumsum(),
'no_operation_flag'])['no_operation_flag']
.transform('size')
.gt(5)
.view('uint8') )
print(df)
输出
code tank date time no_operation_flag final_flag
0 123 1 01-01-2019 00:00:00 1 0
1 123 1 01-01-2019 00:30:00 1 0
2 123 1 01-01-2019 01:00:00 0 0
3 123 1 01-01-2019 01:30:00 1 1
4 123 1 01-01-2019 02:00:00 1 1
5 123 1 01-01-2019 02:30:00 1 1
6 123 1 01-01-2019 03:00:00 1 1
7 123 1 01-01-2019 03:30:00 1 1
8 123 1 01-01-2019 04:00:00 1 1
9 123 1 01-01-2019 05:00:00 1 0
10 123 1 01-01-2019 14:00:00 1 1
11 123 1 01-01-2019 14:30:00 1 1
12 123 1 01-01-2019 15:00:00 1 1
13 123 1 01-01-2019 15:30:00 1 1
14 123 1 01-01-2019 16:00:00 1 1
15 123 1 01-01-2019 16:30:00 1 1
16 123 2 02-01-2019 00:00:00 1 0
17 123 2 02-01-2019 00:30:00 0 0
18 123 2 02-01-2019 01:00:00 0 0
19 123 2 02-01-2019 01:30:00 0 0
20 123 2 02-01-2019 02:00:00 1 0
21 123 2 02-01-2019 02:30:00 1 0
22 123 2 02-01-2019 03:00:00 1 0
23 123 2 03-01-2019 03:30:00 1 0
24 123 2 03-01-2019 04:00:00 1 0
25 123 1 03-01-2019 14:00:00 1 0
26 123 2 03-01-2019 15:00:00 1 0
27 123 2 03-01-2019 00:30:00 1 0
28 123 2 04-01-2019 11:00:00 1 0
29 123 2 04-01-2019 11:30:00 0 0
30 123 2 04-01-2019 12:00:00 1 0
31 123 2 04-01-2019 13:30:00 1 0
32 123 2 05-01-2019 03:00:00 1 0
33 123 2 05-01-2019 03:30:00 1 0
答案 2 :(得分:0)
也许可以一口气做,但是两步走的方法更简单, 首先,您一个一个地选择坦克,然后寻找五个1的序列。
https://developers.facebook.com/docs/marketing-api/offline-conversions#extern-id已经解决了在列中搜索模式的问题。
如果您想以其他方式查看This other question,则可以求和1
或使用all values are True
条件来查找{{1 }}元素。
您也可以rolling屏蔽一列,但这只会给您屏蔽中的值。这解决了另一个问题,“在给定的时间哪些坦克在哪些地方不起作用”。
答案 3 :(得分:0)
我认为这是一种非常过时且有点脏的方式,但易于理解。
df['no_operation_flag']
的五个对应值全部为1。df['final_flag']
的对应五个值中放入1。# make col with zero
df['final_flag'] = 0
for i in range(1, len(df)-4):
j = i + 4
dt1 = df['date'].iloc[i]+' '+df['time'].iloc[i]
ts1 = pd.to_datetime(dt1)
dt2 = df['date'].iloc[j]+' '+df['time'].iloc[j]
ts2 = pd.to_datetime(dt2)
# timedelta is 2 hours?
if ts2 - ts1 == datetime.timedelta(hours=2, minutes=0):
# all of no_operation_flag == 1?
if (df['no_operation_flag'].iloc[i:j+1] == 1).all():
df['final_flag'].iloc[i:j+1] = 1