DataFrame组通过多级选择

时间:2019-02-10 18:41:08

标签: python python-3.x pandas

我正在尝试使用熊猫解决我使用纯python完成的问题,但不了解DataFrame groupby的最佳实践。

我想为每个邮政编码选择处方最多的药物(该邮政编码中所有药物的百分比)。 如果两种药物的处方数量相同,我要选择“第一个字母”:

import pandas as pd

drugs_prescriptions = pd.DataFrame({'PostCode': ['P1', 'P1', 'P1', 'P2', 'P2', 'P3'],
                    'Drug': ['D1', 'D2', 'D1', 'D2', 'D1', 'D2'],
                    'Quantity': [3, 6, 5, 7, 7, 8]})

    Drug    PostCode    Quantity
# 0 D1      P1          3
# 1 D2      P1          6
# 2 D1      P1          5
# 3 D2      P2          7
# 4 D1      P2          7
# 5 D2      P3          8

#This should be the RESULT:
# postCode, drug with highest quantity, percentage of all drugs per post code
# (post code P2 has two drugs with the same quantity, alphabetically first one is selected
# [('P1', 'D1', 0.57),
# ('P2', 'D1', 0.50),
# ('P3', 'D2', 1)]

我已经按照邮编,毒品进行了分组,但是在选择行时遇到了问题(应用lambda)。

durg_qualtity_per_post_code = drugs_prescriptions.groupby(['PostCode', 'Drug']).agg('sum')

enter image description here

所有药品均按邮政编码出售,我打算将此药品与以前的数据集一起套用或转换:

all_by_post_code = drugs_prescriptions.groupby(['PostCode'])['Quantity'].sum()

enter image description here

我不确定如何选择每个邮政编码的药品最大数量的行,如果两种药品的数量相同,则应选择第一个字母顺序的药品(邮政编码P2为D1)。

我想做这样的事情:

durg_qualtity_per_post_code [durg_qualtity_per_post_code .apply(lambda x: int(x['Quantity']) == max_items_by_post_code[x['post_code']], axis=1, reduce=True)]

更新:

# sort by PostCode, Drug
df = drugs_prescriptions.groupby(['PostCode', 'Drug']).agg('sum')
df = df.groupby(['PostCode']).apply(lambda x: x.sort_values(['Quantity', 'Drug'], ascending=[False, True]))

# select first value by PostCode
# reset index in order to have drug in the output as well
df.reset_index(level=[1], inplace=True)
df = df.groupby(['PostCode']).first()

# calculate percentage of total by PostCode
allQuantities = drugs_prescriptions.groupby(['PostCode']).agg('sum')
df['Quantity'] = df.apply(lambda row: row['Quantity']/allQuantities.loc[row.name], axis=1)

1 个答案:

答案 0 :(得分:2)

这是一种可能的解决方案,但它感觉笨拙且不合Python。但这有效,注释在代码中。

# setting string to integer
df.Quantity = df.Quantity.astype('int')

# create a mulitiindex
df.set_index(['PostCode', 'Drug'], inplace=True)

# use transform to divide the sum of the 'Drug' level by the 'PostCode' level
df = df.groupby(level=[0,1]).transform('sum') / df.groupby(level=0).transform('sum')

# move 'Drug' out of the multi index to allow for sorting
df.reset_index(level=[1], inplace=True)

# Sort the 'Quantity' descending order, and the 'Drug' in ascending order,
# then we can select the first 'PostCode' for our result
df.sort_values(['Quantity','Drug'], ascending=[False, True], inplace=True)

df.groupby('PostCode').first()

           Drug Quantity
PostCode        
P1          D1  0.571429
P2          D1  0.500000
P3          D2  1.000000