Question

我被要求从医疗记录应用程序中分析数据库。所以一堆记录看起来像：

所以我必须从2011年到2014年恢复超过300万条记录，我知道他们会重复，因为每个病人的身份证，所以患者应该多次去看医生。我怎样才能将他们分组或由病人重新开始。

Answer 1

我不知道你的意思是＆＃34; resume＆＃34;，但看起来你想做的只是以更好的方式排序和显示数据。您可以直观地分组（=订购）记录＆＃34; px-和fecha-wise＆＃34;像这样：

df.set_index(['px', 'fecha'], inplace=True)

修改

当您基于某些公共属性执行数据分组时，您必须决定要对其他列中的数据使用何种聚合。简单来说，一旦你执行了一个groupby，每个剩下的列中只有一个空字段用于每个＆＃34; pacient_id＆＃34;因此，您必须使用一些聚合函数（例如，sum，mean，min，avg，count，...），它将返回分组数据的所需可表示值。

由于数据被锁定在图像中，因此很难处理您的数据，并且无法通过＆＃34; Age＆＃34;来判断您的意思，因为此列不可见，但我希望您能通过使用虚拟数据查看以下示例来实现您想要的目标：

import pandas as pd import numpy as np from datetime import datetime import random from datetime import timedelta def random_datetime_list_generator(start_date, end_date,n): return ((start_date + timedelta(seconds=random.randint(0, int((end_date - start_date).total_seconds())))) for i in xrange(n)) #create random dataframe with 4 sample columns and 50000 rows rows = 50000 pacient_id = np.random.randint(100,200,rows) dates = random_datetime_list_generator(pd.to_datetime("2011-01-01"),pd.to_datetime("2014-12-31"),rows) age = np.random.randint(10,80,rows) bill = np.random.randint(1,1000,rows) df = pd.DataFrame(columns=["pacient_id","visited","age","bill"],data=zip(pacient_id,dates,age,bill)) print df.head() # 1.Only perform statictis of the last visit of each pacient only stats = df.groupby("pacient_id",as_index=False)["visited"].max() stats.columns = ["pacient_id","last_visited"] print stats # 2. Perform a bit more complex statistics on pacient by specifying desired aggregate function for each column custom_aggregation = {'visited':{"first visit": 'min',"last visit": "max"}, 'bill':{"average bill" : "mean"}, 'age': 'mean'} #perform a group by with custom aggregation and renaming of functions stats = df.groupby("pacient_id").agg(custom_aggregation) #round floats stats = stats.round(1) print stats

原始虚拟数据框如下所示：

pacient_id visited age bill 0 150 2012-12-24 21:34:17 20 188 1 155 2012-10-26 00:34:45 17 672 2 116 2011-11-28 13:15:18 33 360 3 126 2011-06-03 17:36:10 58 167 4 165 2013-07-15 15:39:31 68 815

第一个聚合看起来像这样：

pacient_id last_visited 0 100 2014-12-29 00:01:11 1 101 2014-12-22 06:00:48 2 102 2014-12-26 11:51:41 3 103 2014-12-29 15:01:32 4 104 2014-12-18 15:29:28 5 105 2014-12-30 11:08:29

其次，复杂的聚合看起来像这样：

visited age bill first visit last visit mean average bill pacient_id 100 2011-01-06 06:11:33 2014-12-29 00:01:11 45.2 507.9 101 2011-01-01 20:44:55 2014-12-22 06:00:48 44.0 503.8 102 2011-01-02 17:42:59 2014-12-26 11:51:41 43.2 498.0 103 2011-01-01 03:07:41 2014-12-29 15:01:32 43.5 495.1 104 2011-01-07 18:58:11 2014-12-18 15:29:28 45.9 501.7 105 2011-01-01 03:43:12 2014-12-30 11:08:29 44.3 513.0

这个例子可以帮到你。另外，关于pandas groupby聚合有一个很好的SO question，它可以教你很多关于这个主题的内容。

使用agg（）的许多列的Panda groupby

1 个答案: