熊猫滚动窗口:基于窗口值的开发规则

时间:2020-10-03 20:15:41

标签: python pandas dataframe rolling-computation

我正在研究一个新生儿项目,长话短说,就是根据给定时间点的症状给新生儿分配一定的分数,并根据分数随着时间的变化而变化,我们来决定是否增加药物剂量,使其保持不变或断奶。我们将这3个状态分别表示为+1(增加),0(保持)或-1(断奶)。决定做什么的规则如下:

  • 如果三个连续得分之和> = 24或单个得分> = 12,则增加剂量。
  • 如果不符合增加或减少剂量的规则,则保持剂量
  • 如果至少有48小时不需要增加剂量,则降低剂量,最近的3个得分的总和为<18,并且没有一个得分大于8。

在这里人们的帮助下,我们拥有用于增加剂量和保持剂量的代码。但是,我正在努力编写规则来确定如何降低剂量。这是我们拥有的代码示例:

import pandas as pd

df = pd.DataFrame({
   'baby': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'],
   'dateandtime':  ['8/2/2009  5:00:00 PM', '7/19/2009  5:00:00 PM', '7/19/2009  5:00:00 PM', '7/17/2009  6:00:00 AM','7/17/2009  12:01:00 AM', '7/14/2009  12:01:00 AM', '7/19/2009  5:00:00 AM', '7/16/2009  9:00:00 PM','7/19/2009  9:00:00 AM', '7/14/2009  6:00:00 PM', '7/15/2009  3:04:00 PM', '7/20/2009  5:00:00 PM','7/16/2009  12:01:00 AM', '7/18/2009  1:00:00 PM', '7/16/2009  6:00:00 AM', '7/13/2009  9:00:00 PM','7/19/2009  1:00:00 AM','7/15/2009  12:04:00 AM'],
   'score':  [6, 3, 3, 5, 10, 14, 5, 4, 11, 4, 4, 6, 7, 4, 6, 12, 6, 6]
    })

df.dateandtime = pd.to_datetime(df['dateandtime']) # change column type for ease of indexing
df = df.set_index('dateandtime')
df.sort_index(inplace = True)
df = df[~df.index.duplicated()] #Remove any duplicated rows

#Calculate conditions
df['sum_3_scores'] = df.groupby('baby')['score'].rolling(3).sum().reset_index(0,drop=True)
df['max_1_score'] = df.groupby('baby')['score'].rolling(1).max().reset_index(0,drop=True)

#you don't nead to calculate the 24hr mean because the 48hr max is 8 the 24hr mean will also be < 8 
#df['mean_24hr_score'] = df.groupby('baby')['score'].rolling('24h').mean().reset_index(0,drop=True)

#scoring logic
def score(data):
    if data['sum_3_scores'] >= 24 or data['max_1_score'] >= 12:
        return 1
    return 0

df['rule'] = df.apply(score, axis = 1)

df.reset_index().set_index(['baby','dateandtime']).sort_index()
print(df)

这将产生一个具有我想要的漂亮数据框(除了减少剂量的规则外):

                    baby  score  sum_3_scores  max_1_score  rule
dateandtime                                                     
2009-07-13 21:00:00    B     12           NaN         12.0     1
2009-07-14 00:01:00    A     14           NaN         14.0     1
2009-07-14 18:00:00    B      4           NaN          4.0     0
2009-07-15 00:04:00    B      6          22.0          6.0     0
2009-07-15 15:04:00    B      4          14.0          4.0     0
2009-07-16 00:01:00    B      7          17.0          7.0     0
2009-07-16 06:00:00    B      6          17.0          6.0     0
2009-07-16 21:00:00    A      4           NaN          4.0     0
2009-07-17 00:01:00    A     10          28.0         10.0     1
2009-07-17 06:00:00    A      5          19.0          5.0     0
2009-07-18 13:00:00    B      4          17.0          4.0     0
2009-07-19 01:00:00    B      6          16.0          6.0     0
2009-07-19 05:00:00    A      5          20.0          5.0     0
2009-07-19 09:00:00    A     11          21.0         11.0     0
2009-07-19 17:00:00    A      3          19.0          3.0     0
2009-07-20 17:00:00    B      6          16.0          6.0     0
2009-08-02 17:00:00    A      6          20.0          6.0     0

编程降低剂量规则的简便方法是什么?我知道我可以使用代码df.groupby('baby')['score']。rolling('48h')来执行48h窗口,但是我不清楚如何仅检查3个最近剂量的总和该窗口的

1 个答案:

答案 0 :(得分:0)

您的设置:

table[hash(element) % table_length].push(element)

我将在import pandas as pd df = pd.DataFrame({ 'baby': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'], 'dateandtime': ['8/2/2009 5:00:00 PM', '7/19/2009 5:00:00 PM', '7/19/2009 5:00:00 PM', '7/17/2009 6:00:00 AM','7/17/2009 12:01:00 AM', '7/14/2009 12:01:00 AM', '7/19/2009 5:00:00 AM', '7/16/2009 9:00:00 PM','7/19/2009 9:00:00 AM', '7/14/2009 6:00:00 PM', '7/15/2009 3:04:00 PM', '7/20/2009 5:00:00 PM','7/16/2009, 12:01:00 AM', '7/18/2009 1:00:00 PM', '7/16/2009 6:00:00 AM', '7/13/2009 9:00:00 PM','7/19/2009 1:00:00 AM','7/15/2009 12:04:00 AM'], 'score': [6, 3, 3, 5, 10, 14, 5, 4, 11, 4, 4, 6, 7, 4, 6, 12, 6, 6] }) df.dateandtime = pd.to_datetime(df['dateandtime']) # change column type for ease of indexing df = df.set_index('dateandtime') df = df[~df.index.duplicated()] #Remove any duplicated rows 三次上使用.diff()。手动检查.groupby()max_last3sum_last3时,我建议按last48h_any_criticalbaby进行排序:

dateandtime

要先按宝贝分组来获取最后3个值的总和,然后获得3个滚动窗口,然后获得每个窗口的总和。 重要:如果前两个值例如12、13这两个的总和> = 24,但是无法建立大小为3的窗口!因此,值将为# this helps df = df.sort_values(by=['baby', 'dateandtime']) # this is okay too df.sort_index(inplace=True) NaN。要允许构建不完整的窗口,请使用(Nan >= 24) == False

min_periods=1

我仍然不确定您是要查看所有分数,最后3个分数还是仅查看最后一个分数。此实现在最后3个分数中检测到> = 12的值。最后是替代解决方案。

sum_last3 = df.groupby('baby')['score'].rolling(3, min_periods=1).sum()
df['sum_last3'] = sum_last3.reset_index(level=0, drop=True)

df['sum_last3_critical'] = df['sum_last3'] >= 24
df['sum_last3_good'] = df['sum_last3'] < 18

现在,您可以建立一个max_last3 = df.groupby('baby')['score'].rolling(3, min_periods=1).max() df['max_last3'] = max_last3.reset_index(level=0, drop=True) df['max_last3_ciritical'] = df['max_last3'] >= 12 df['max_last3_good'] = df['max_last3'] < 8 列,该列指示是否必须增加Dosis。必须

critical

现在,您将获得48小时的时间窗口并获得关键列的最大值(如果为True,则为1.0;如果为False,则为0.0)。理想情况下,您将使用df['critical'] = df['sum_last3_critical'] | df['max_last3_ciritical'] ,但是.any()对象不存在此对象。由于GroupBy返回一个数值,此后将其转换回布尔值。

.max()

现在您可以让宝宝保持良好的状况,应该减少剂量。

last48h_any_critical = df.groupby('baby').rolling('48h')['critical'].max().astype('bool')
df['last48h_good'] = ~last48h_any_critical.reset_index(level=0, drop=True)

要获取操作值,只需从df['good'] = df['last48h_good'] & df['sum_last3_good'] & df['max_last3_good'] 列中减去good列。

critical

生成的DataFrame如下所示:

df['action'] = df['critical'].astype(int) - df['good'].astype(int)

替代选项

如果要查看所有之前的值,而不是查看最后三个值。请改用 baby score sum_last3 sum_last3_critical sum_last3_good max_last3 max_last3_ciritical max_last3_good critical last48h_good good action dateandtime 2009-07-14 00:01:00 A 14 14.0 False True 14.0 True False True False False 1 2009-07-16 21:00:00 A 4 18.0 False False 14.0 True False True False False 1 2009-07-17 00:01:00 A 10 28.0 True False 14.0 True False True False False 1 2009-07-17 06:00:00 A 5 19.0 False False 10.0 False False False False False 0 2009-07-19 05:00:00 A 5 20.0 False False 10.0 False False False True False 0 2009-07-19 09:00:00 A 11 21.0 False False 11.0 False False False True False 0 2009-07-19 17:00:00 A 3 19.0 False False 11.0 False False False True False 0 2009-08-02 17:00:00 A 6 20.0 False False 11.0 False False False True False 0 2009-07-13 21:00:00 B 12 12.0 False True 12.0 True False True False False 1 2009-07-14 18:00:00 B 4 16.0 False True 12.0 True False True False False 1 2009-07-15 00:04:00 B 6 22.0 False False 12.0 True False True False False 1 2009-07-15 15:04:00 B 4 14.0 False True 6.0 False True False False False 0 2009-07-16 00:01:00 B 7 17.0 False True 7.0 False True False False False 0 2009-07-16 06:00:00 B 6 17.0 False True 7.0 False True False False False 0 2009-07-18 13:00:00 B 4 17.0 False True 7.0 False True False True True -1 2009-07-19 01:00:00 B 6 16.0 False True 6.0 False True False True True -1 2009-07-20 17:00:00 B 6 16.0 False True 6.0 False True False True True -1

expanding

如果您只想查看最后一个值,则可以直接与# ideally change name of max_last3 to something like max_alltime max_last3 = df.groupby('baby')['score'].expanding().max() df['max_last3'] = max_last3.reset_index(level=0, drop=True) df['max_last3_ciritical'] = df['max_last3'] >= 12 df['max_last3_good'] = df['max_last3'] < 8 进行比较:

score