我有一个要处理的数据框(以下类似)。
| user | weight | height | timestamp |
| ------ | ----- |------ |---------------------------------|
| user 1 | NA | 180 | 2017-11-09 17:21:49.025259+00:00|
| user 1 | 90 | NA | 2017-11-09 17:21:49.025259+00:00|
| user 1 | 89 | NA | 2017-11-10 17:21:49.025259+00:00|
| user 1 | 88 | NA | 2017-11-11 17:21:49.025259+00:00|
| user 2 | NA | 175 | 2017-11-09 17:21:49.025259+00:00|
| user 2 | 95 | NA | 2017-11-09 17:21:49.025259+00:00|
| user 3 | NA | 170 | 2017-12-09 17:21:49.025259+00:00|
| user 4 | 70 | NA | 2017-08-09 17:21:49.025259+00:00|
我的最终目标是使用每个体重和身高的最新时间戳数据来计算每个用户的BMI。一些用户只有一个基准,许多用户具有多个重量,高度或两者兼有。由于数据以kg和cm为单位,因此BMI为
BMI =(体重(公斤)x 10,000)/(((厘米))^ 2)
我希望返回类似以下的内容。 (字典就足够了)
| user | BMI |
| ------ | ------------------|
| user 1 | 27.1604938272 |
| user 2 | 31.0204081633 |
| user 3 | NA |
| user 4 | NA |
答案 0 :(得分:1)
使用:
#convert to datetimes
df['timestamp'] = pd.to_datetime(df['timestamp'])
#sorting df by datetimes
df = df.sort_values('timestamp')
#for each group get last non NaN value
df1 = df.groupby('user').last()
#apply formula
BMI = (df1['weight'] * 10000 / (df1['height'])**2).reset_index(name='BMI')
print (BMI)
user BMI
0 user 1 27.160494
1 user 2 31.020408
2 user 3 NaN
3 user 4 NaN
#for dictionary
d = (df1['weight'] * 10000 / (df1['height'])**2).to_dict()
print (d)
{'user 1': 27.160493827160494,
'user 2': 31.020408163265305,
'user 3': nan,
'user 4': nan}