Question

这是我的数据：

{'SystemID': {0: '95EE8B57',
1: '5F891F03',
2: '5F891F03',
3: '5F891F03',
4: '95EE8B57'},
'Activity': {0: '2', 1: '4', 2: '6',   3: '8', 4: '1'}}

我需要总结“活动”中的数据集。对于每个“ SystemID”，我需要计算属于以下4个类别中的 percentage ：小于2、2或更大但小于4、4或更大但小于6、6或更大。

以上代码段的结果为：

{'SystemID': {0: '95EE8B57',
1: '5F891F03'},
'Activity-perc-less2': {0: '50', 1: '0'},
'Activity-perc-less4': {0: '50', 1: '0'},
'Activity-perc-less6': {0: '0', 1: '33.3'},
'Activity-perc-6-and-above': {0: '0', '66.7'}}

该怎么做？

Answer 1

我不确定这是否是最优雅的方法，但是以下内容似乎可以满足我的要求：

dict2 = {'SystemID': {0: '95EE8B57',
1: '5F891F03',
2: '5F891F03',
3: '5F891F03',
4: '95EE8B57'},
'Activity': {0: '2', 1: '4', 2: '6',   3: '8', 4: '1'}}

df2 = pd.DataFrame.from_dict(dict2)
bins = np.array([2, 4, 6])
df2.Activity = df2.Activity.astype(int)

#Solution:

df2['ActBins'] = np.digitize(df2.Activity, bins)
table = pd.pivot_table(df2, index=["SystemID"],columns=["ActBins"], 
                   aggfunc=len, margins=True, dropna=True,fill_value=0)
table2 = 100*table.div( table.iloc[:,-1], axis=0 )
table3 = table2.iloc[[0,1],[0,1,2,3]]
table3.columns = ['Activity-perc-less2', 'Activity-perc-less4', 
'Activity-perc-less6', 'Activity-perc-6-and-above']
print(table3)

如果有人找到一些更优雅的解决方案，请发布它。

编辑：

只需将上述解决方案作为一个函数进行提取：

def perc_pivot (df, ind, col, bin):
  df[col+'Bins'] = np.digitize(df[col], bins)
  table = pd.pivot_table(df, index=[ind],columns=[col+'Bins'], 
           aggfunc=len, margins=True, dropna=True,fill_value=0)
  table = 100*table.div( table.iloc[:,-1], axis=0 )
  table.drop(table.tail(1).index,inplace=True)
  return  table.drop(table.columns[len(table.columns)-1], axis=1)

一个简单的电话

df3 = perc_pivot(df2, 'SystemID', 'Activity', bins)

产生所需的输出（除了列名）。之后可以手动重命名列。

该函数中的代码在我看来仍然有些笨拙，因此我欢迎提出一些建议，以使其更加美观。

总结熊猫数据框中的值分布

1 个答案: