在熊猫列中查找值组合的频率

时间:2018-09-04 13:43:31

标签: python pandas

比方说,我有三个项目类别A, B and C,它们可以具有三个不同的状态Started, Finished and canceled。我正在尝试查找所有类别中具有例如Status = Finished的类别的百分比。我认为第一步应该是将其显示在如下所示的矩阵中:

输入:

Category Status
A Started
A Started
A Finished
A Finished
B Started
B Canceled
B Canceled
C Started
C Finished

所需的输出:

    Started Finished Canceled
A   2       2        0
B   1       0        2
C   1       1        0

但是我正在努力确定例如可以拥有的可能状态,但是没有。在此示例中,Canceled = 0。我一直在尝试通过Category的唯一观察值对pandas数据框进行分组,然后将它们合并以用nan填充不存在的组合来实现此目的。但是我认为对于较大的日期集,这非常慢。另外,我还没到那儿。如果有人想尝试在此处构建代码,请参见以下代码。但我怀疑那里有更有效的解决方案...

我的尝试

import pandas as pd
import numpy as np

#df = pd.read_clipboard(sep='\\s+')
# dft = df.T
frames = {}
n = 0

status = df['Status'].unique()

# Subset and create dataframes
for category in df['Category'].unique():
    n = n + 1
    newname = 'df_' + str(n)
    print(newname)
    dfs = df[df['Category']==category]
    frames[newname] = dfs

# Join dataframes
df_main = frames['df_1']
frames.pop('df_1')


for key in frames:
    df_main = pd.merge(df_main, frames[key], on = 'Category', how = 'outer')

df_main = df_main.set_index(['Category'])
df_main.columns = status

输出df_main:

          Started  Finished  Canceled
Category                              
A          Started       NaN       NaN
A          Started       NaN       NaN
A         Finished       NaN       NaN
A         Finished       NaN       NaN
B              NaN   Started       NaN
B              NaN  Canceled       NaN
B              NaN  Canceled       NaN
C              NaN       NaN   Started
C              NaN       NaN  Finished

0 个答案:

没有答案