熊猫 - 选择前N< L多列的最常见类别和连接结果向量

时间:2018-02-05 01:53:35

标签: python pandas

在Pandas中,我按类型分离了数据,我需要总结分类数据的频率。我需要将所有级别提升到50个级别。

现在我有这样的事情(示例数据如下):

# Libraries
import numpy    as     np
import pandas   as     pd

# Categorical variables
df = pd.DataFrame(np.random.randint(low  = 0, 
                                    high = 1000000, 
                                    size = (1000, 2)),
                       columns=['CASE_NUMBER', 'CLIENT_ID'])
df['CASE_NUMBER'] = df['CASE_NUMBER'].apply(str)
df['CLIENT_ID']   = df['CLIENT_ID'].apply(str)


df['PRODUCTCATEGORY']       = np.random.randint(low=0, high=2, size=(1000, 1))
df['PRODUCTTYPE']           = np.random.randint(low=0, high=2, size=(1000, 1))
df['PRODUCTTYPE']           = np.random.randint(low=0, high=2, size=(1000, 1))
df['PRODUCT_CATEGORY_DESC'] = np.random.randint(low=0, high=2, size=(1000, 1))
df['PRODUCT_DESC']          = np.random.randint(low=0, high=2, size=(1000, 1))

df.loc[df['PRODUCTCATEGORY'] == 0 , 'PRODUCTCATEGORY']             = "AC2"
df.loc[df['PRODUCTCATEGORY'] == 1 , 'PRODUCTCATEGORY']             = "AC1"
df.loc[df['PRODUCTTYPE'] == 0 , 'PRODUCTTYPE']                     = "AT2"
df.loc[df['PRODUCTTYPE'] == 1 , 'PRODUCTTYPE']                     = "AT1"
df.loc[df['PRODUCT_CATEGORY_DESC'] == 0 , 'PRODUCT_CATEGORY_DESC'] = "Revocable"
df.loc[df['PRODUCT_CATEGORY_DESC'] == 1 , 'PRODUCT_CATEGORY_DESC'] = "Irrevocable"
df.loc[df['PRODUCT_DESC'] == 0 , 'PRODUCT_DESC']                   = "Immediate"
df.loc[df['PRODUCT_DESC'] == 1 , 'PRODUCT_DESC']                   = ""

我做了一些非常丑陋的尝试,开始类似于下面的内容,但除了冗长之外它很慢并且如果所有列中的最大级别数是<而且还添加了不必要的行。 50:

e = df.describe()

table2 = pd.DataFrame({
        'Variable Name': e.columns,
    })

for n in e.columns:
    for i in range(50):
        grouped = df.groupby([n]).size().reset_index()
        grouped = grouped.sort_values(0, ascending=False)
        table2 = pd.concat([table2, grouped], ignore_index=True, axis=1)

这是我最终要做的一个例子(注意:计数是由与上述数据不完全对应的数字组成的)。您不必处理Variable NamePercent(如果您这样做,可以获得奖励积分!):

enter image description here

1 个答案:

答案 0 :(得分:0)

解决方案的关键在于@JonClements的评论:

table2 = df.melt().groupby(['variable', 'value']).size() 

从那里我刚刚添加了一些逻辑来截断并转换结果:

table2 = table2.to_frame(name='Count')
table2 = table2.reset_index(inplace=False)
table2['Percent'] = table2['Count'] / len(df.index)

for v in table2['variable'].unique():    
    tmp    = table2[table2.variable.str.contains(v) == True]
    table2 = table2[table2.variable.str.contains(v) == False]
    if tmp.shape[0] > 50:
        tmp0 = tmp.iloc[:50,]
        tmp1 = pd.DataFrame([{'variable':v,
                            'value': 'Other',
                            'Count':tmp.shape[0]-50,
                            'Percent':sum(tmp0['Percent'])
                             }])
        tmp = tmp0.append(tmp1)
    table2 = table2.append(tmp)

print(table2)