比方说,我有三个项目类别A, B and C
,它们可以具有三个不同的状态Started, Finished and canceled
。我正在尝试查找所有类别中具有例如Status = Finished
的类别的百分比。我认为第一步应该是将其显示在如下所示的矩阵中:
输入:
Category Status
A Started
A Started
A Finished
A Finished
B Started
B Canceled
B Canceled
C Started
C Finished
所需的输出:
Started Finished Canceled
A 2 2 0
B 1 0 2
C 1 1 0
但是我正在努力确定例如可以拥有的可能状态,但是没有。在此示例中,Canceled = 0
。我一直在尝试通过Category
的唯一观察值对pandas数据框进行分组,然后将它们合并以用nan
填充不存在的组合来实现此目的。但是我认为对于较大的日期集,这非常慢。另外,我还没到那儿。如果有人想尝试在此处构建代码,请参见以下代码。但我怀疑那里有更有效的解决方案...
我的尝试
import pandas as pd
import numpy as np
#df = pd.read_clipboard(sep='\\s+')
# dft = df.T
frames = {}
n = 0
status = df['Status'].unique()
# Subset and create dataframes
for category in df['Category'].unique():
n = n + 1
newname = 'df_' + str(n)
print(newname)
dfs = df[df['Category']==category]
frames[newname] = dfs
# Join dataframes
df_main = frames['df_1']
frames.pop('df_1')
for key in frames:
df_main = pd.merge(df_main, frames[key], on = 'Category', how = 'outer')
df_main = df_main.set_index(['Category'])
df_main.columns = status
输出df_main:
Started Finished Canceled
Category
A Started NaN NaN
A Started NaN NaN
A Finished NaN NaN
A Finished NaN NaN
B NaN Started NaN
B NaN Canceled NaN
B NaN Canceled NaN
C NaN NaN Started
C NaN NaN Finished