我有一个熊猫数据框,如:
%pylab inline
import seaborn as sns
sns.set(color_codes=True)
import pandas as pd
import numpy as np
df = pd.DataFrame({"user_id": [1, 2, 3, 4, 5,
6, 7, 8, 9],
"is_sick": [0, 0, 0, 0, 0,
0, 1, 1, 1],
"sex": ["male", "female", "male", "female", "female",
"male", "male", "female", "female"],
"age_group": ["young", "old", "old", "young",
"small", "old", "young", "young",
"old"],
"metric_1": [1, 2, 2, 3, 3, 4, 5, 6, 7]})
df['date'] = '2019-01-01'
df['qcut_metric_1'] = pd.qcut(df.metric_1, [0, .25, .5, .66, .75, .97, 1])
# make some more data
df_2 = df.copy()
df_2['date'] = '2019-02-01'
df = pd.concat([df, df_2])
现在,我想针对该指标的每个分类计算每组/队列[(sex), (age_group), (sex, age_group)]
中的患者百分比。
请注意,我知道单个聚合(即sex
的聚合看起来可能类似于:
df['sick_percentage__sex'] = df.groupby(['sex']).is_sick.transform(pd.Series.mean)
一个幼稚的表可能看起来像:
pd.pivot_table(df, values='sick_percentage__sex', index=['qcut_metric_1', 'sex'], columns=[], aggfunc=np.mean)
看起来像:
sick_percentage__sex
qcut_metric_1 sex
(0.999, 2.0] female 0.40
male 0.25
(2.0, 3.0] female 0.40
(3.0, 4.28] male 0.25
(4.28, 5.0] male 0.25
(5.0, 6.76] female 0.40
(6.76, 7.0] female 0.40
但这不适用于显示分类指标(qcut_metric_1
)和所有队列([(sex), (age_group), (sex, age_group)]
)的疾病百分比。如何适应呢?也许使用多维聚合?
所需的输出格式:
qcut_metric_1, cohort, percentage_of_sickness
np.mean
作为数据透视聚合功能可能会提供倾斜的结果(因为如果每个组的用户数不是恒定的,则分组均值可能不是可交换的)。因此,我需要使用加权平均值。我更新了示例数据集。
agg = df.groupby(['sex']).agg({'user_id':pd.Series.nunique, 'is_sick':pd.Series.mean})
agg.columns = ['unique_users', 'sick_percentage__sex']
df = df.merge(agg, on='sex')
现在给出要输入到数据透视表的数据框。
但是现在我也在使用加权均值的语法:
def wavg(x):
print(x)
return np.average(x['sick_percentage__sex'], weights= x['unique_users'])
作为数据透视表 pd.pivot_table(df,values = ['sick_percentage__sex','unique_users'],index = ['qcut_metric_1','sex'],column = [],aggfunc = wavg) 仅将单个序列(而不是两个(值+权重))传递给函数。
答案 0 :(得分:0)
也许透视表不是解决问题的正确方法。
一个最小的解决方案可能看起来像下面的代码,并遍历所有同类群组。
是否有可能提供更有效的解决方案?对于未压缩的CSV,我的输入文件为120G /通过gzip压缩后仍保留3GB的空间,这意味着熊猫需要约35GB的内存。
%pylab inline
import seaborn as sns
sns.set(color_codes=True)
import pandas as pd
import numpy as np
df = pd.DataFrame({"user_id": [1, 2, 3, 4, 5,
6, 7, 8, 9],
"is_sick": [0, 0, 0, 0, 0,
0, 1, 1, 1],
"sex": ["male", "female", "male", "female", "female",
"male", "male", "female", "female"],
"age_group": ["young", "old", "old", "young",
"small", "old", "young", "young",
"old"],
"metric_1": [1, 2, 2, 3, 3, 4, 5, 6, 7]})
df['date'] = '2019-01-01'
df['qcut_metric_1'] = pd.qcut(df.metric_1, [0, .25, .5, .66, .75, .97, 1])
# make some more data
df_2 = df.copy()
df_2['date'] = '2019-02-01'
df = pd.concat([df, df_2])
cohorts = [['sex', 'age_group'], ['sex'], ['age_group']]
for cohort in cohorts:
cohort_name = '_'.join(cohort)
# print(cohort_name)
agg = df.groupby(cohort).agg({'user_id':pd.Series.nunique, 'is_sick':pd.Series.mean})
sick_percentage_column = f'sick_percentage__{cohort_name}'
agg.columns = ['unique_users', sick_percentage_column]
merged = df.merge(agg, on=cohort) # INNER (default) JOIN ok, as agg derived from total => no values lost
groupings = ['qcut_metric_1']
groupings.extend(cohort)
result = merged.groupby(groupings).apply(lambda x: np.average(x[sick_percentage_column], weights= x['unique_users'])).reset_index().rename({0:sick_percentage_column}, axis=1)
display(result)