我有一个具有5000万条记录的熊猫数据框,而我想做的是根据条件进行回填。如我们所见,名称800A和Barber的时间戳对齐,因此我假设数据属于同一名称,并且在记录数据时只是一个错误。米娅的名字也一样。
这只是示例数据。
我的数据框看起来像这样。
datetime name dischargeDate HR Sp x_inc vs_inc rec_num
01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445
01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445
01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783
03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923
现在,我尝试回填记录编号(rec_num)列,直到它在x_inc和vs_inc列中都映射布尔条件False False为止。
实际输出:
datetime name dischargeDate HR Sp x_inc vs_inc rec_num
01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445
01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445
01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE 7192783
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE 7192783
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783
03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE 6728923
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE 6728923
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923
预期输出:
datetime name dischargeDate HR Sp x_inc vs_inc rec_num
01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445
01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445
01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783
03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923
我正在使用df['rec_num'].fillna(method='bfill')
,但是它完全填满了,这不是我的理想解决方案。如果能对这个问题有任何建议(或者有更好的方法),我将不胜感激。预先感谢。
答案 0 :(得分:2)
使用布尔掩码和np.where()
,您可以使用此功能:
cond=(df.x_inc == False) & (df.vs_inc == False) #creates a boolean mask where both columns are false
df['new_rec']=np.where(~cond,df.rec_num.bfill(),df.rec_num) #does a backfill on where condition is not met
print(df)
注意: 您可以将值重新分配给名为rec_num
的旧列,而无需创建新列。我补充说,以便您可以进行比较。这也是自向量化以来最快的方法
datetime name dischargeDate HR Sp x_inc vs_inc rec_num new_rec
0 2019-05-01 18:04:50 Zawisza 2005-01-14 18:05:00 119.0 98.0 False True 6458445.0 6458445.0
1 2019-05-01 18:04:55 Zawisza 2005-01-14 18:05:00 120.0 97.0 False True 6458445.0 6458445.0
2 2019-05-01 18:05:00 Zawisza 2005-01-14 18:05:00 NaN NaN False False NaN NaN
3 2029-01-01 17:58:45 800A 2029-01-14 17:59:10 NaN NaN False False NaN NaN
4 2029-01-01 17:58:50 800A 2029-01-14 17:59:10 139.0 NaN False True NaN 7192783.0
5 2029-01-01 17:58:55 800A 2029-01-14 17:59:10 138.0 NaN False True NaN 7192783.0
...........................................................
...........................................................
....................................................
.....................................
答案 1 :(得分:1)
您可以使用应用。
创建函数:
def foo(x):
if not bool(x['epic_include']) and not bool(x['vs_include']):
return None
else:
if not pd.isna(x['twist_mrn']):
return x['twist_mrn']
else:
return df['twist_mrn'].iloc[df.iloc[x.name:]['twist_mrn'].first_valid_index()]
因此,应用:
df['twist_mrn'] = df.apply(foo, axis=1)
输出:
datetime patient_name dischargeDate HR SpO2 epic_include vs_include twist_mrn
0 2014-01-05 18:04:50 Zawisza 2014-01-05 18:05:00 119.0 98.0 False True 4654843.0
1 2014-01-05 18:04:55 Zawisza 2014-01-05 18:05:00 120.0 97.0 False True 4654843.0
2 2014-01-05 18:05:00 Zawisza 2014-01-05 18:05:00 NaN NaN False False NaN
3 2014-01-29 17:58:45 800A 2014-01-29 17:59:10 NaN NaN False False NaN
4 2014-01-29 17:58:50 800A 2014-01-29 17:59:10 139.0 NaN False True 4719848.0
5 2014-01-29 17:58:55 800A 2014-01-29 17:59:10 138.0 NaN False True 4719848.0
6 2014-01-29 17:59:00 800A 2014-01-29 17:59:10 138.0 96.0 False True 4719848.0
7 2014-01-29 17:59:05 800A 2014-01-29 17:59:10 138.0 96.0 False True 4719848.0
8 2014-01-29 17:59:10 800A 2014-01-29 17:59:10 138.0 96.0 False True 4719848.0
9 2014-01-29 17:59:15 Barber 2014-01-29 18:17:15 138.0 96.0 False True 4719848.0
10 2014-01-29 17:59:20 Barber 2014-01-29 18:17:15 138.0 96.0 False True 4719848.0
11 2014-01-29 17:59:25 Barber 2014-01-29 18:17:15 138.0 95.0 False True 4719848.0
12 2015-03-04 21:19:45 800A 2015-03-05 01:00:15 NaN NaN False False NaN
13 2015-03-05 00:53:10 800A 2015-03-05 01:00:15 NaN NaN False False NaN
14 2015-03-05 00:55:40 800A 2015-03-05 01:00:15 NaN 95.0 False True 4163407.0
15 2015-03-05 00:55:45 800A 2015-03-05 01:00:15 NaN 95.0 False True 4163407.0
16 2015-03-05 00:55:50 800A 2015-03-05 01:00:15 NaN 94.0 False True 4163407.0
17 2015-03-05 00:55:55 800A 2015-03-05 01:00:15 81.0 93.0 False True 4163407.0