为了在各个地区进行有意义的比较,我想在不同国家的爆发开始日期之前对COVID-19确诊病例进行标准化。对于任何领土,领土达到或超过10例确诊病例的那一天都被视为“爆发的第0天”。
示例数据框:
[in]
import pandas as pd
confirmed_cases = {'Date':['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20'], 'Australia':[0, 0, 0, 30, 50], 'Albania':[0, 20, 25, 30, 50], 'Algeria':[25, 40, 50, 50, 70]}
df = pd.DataFrame(confirmed_cases)
df
[out]
Date Australia Albania Algeria
0 1/22/20 0 0 25
1 1/23/20 0 20 40
2 1/24/20 0 25 50
3 1/25/20 30 30 50
4 1/26/20 50 50 70
所需结果:
Day Since Outbreak Australia Albania Algeria
0 0 30 20 25
1 1 50 25 40
2 2 NaN 30 50
3 3 NaN 50 50
4 4 NaN NaN 70
有没有办法用简单的Python / Panda代码行来执行此任务?
答案 0 :(得分:4)
为每个国家/地区找到超过阈值(10)的第一个值的索引值,然后将每一列向上移动那么多
df2 = df[['Australia', 'Albania', 'Algeria']].apply(lambda x: x.shift(-(x > 10).idxmax()))
# df2
Australia Albania Algeria
0 30.0 20.0 25
1 50.0 25.0 40
2 NaN 30.0 50
3 NaN 50.0 50
4 NaN NaN 70
重置索引以获取“天起”列
df2.reset_index().rename(columns={'index': 'Day Since Outbreak'})
Day Since Outbreak Australia Albania Algeria
0 0 30.0 20.0 25
1 1 50.0 25.0 40
2 2 NaN 30.0 50
3 3 NaN 50.0 50
4 4 NaN NaN 70
答案 1 :(得分:1)
根据第一次运行的值<10,确定需要shift
每列多少次,然后移动它们。 cummin
确保如果间歇值<10,则不会计入shift
df = df.drop(columns='Date') # Wont need
s = df.lt(10).cummin().sum()
for col, shift in s.iteritems():
df[col] = df[col].shift(-shift)
df['Days Since'] = range(len(df)) # Duplicative with index...
Australia Albania Algeria Days Since
0 30.0 20.0 25 0
1 50.0 25.0 40 1
2 NaN 30.0 50 2
3 NaN 50.0 50 3
4 NaN NaN 70 4