在Pandas中,我按类型分离了数据,我需要总结分类数据的频率。我需要将所有级别提升到50个级别。
现在我有这样的事情(示例数据如下):
# Libraries
import numpy as np
import pandas as pd
# Categorical variables
df = pd.DataFrame(np.random.randint(low = 0,
high = 1000000,
size = (1000, 2)),
columns=['CASE_NUMBER', 'CLIENT_ID'])
df['CASE_NUMBER'] = df['CASE_NUMBER'].apply(str)
df['CLIENT_ID'] = df['CLIENT_ID'].apply(str)
df['PRODUCTCATEGORY'] = np.random.randint(low=0, high=2, size=(1000, 1))
df['PRODUCTTYPE'] = np.random.randint(low=0, high=2, size=(1000, 1))
df['PRODUCTTYPE'] = np.random.randint(low=0, high=2, size=(1000, 1))
df['PRODUCT_CATEGORY_DESC'] = np.random.randint(low=0, high=2, size=(1000, 1))
df['PRODUCT_DESC'] = np.random.randint(low=0, high=2, size=(1000, 1))
df.loc[df['PRODUCTCATEGORY'] == 0 , 'PRODUCTCATEGORY'] = "AC2"
df.loc[df['PRODUCTCATEGORY'] == 1 , 'PRODUCTCATEGORY'] = "AC1"
df.loc[df['PRODUCTTYPE'] == 0 , 'PRODUCTTYPE'] = "AT2"
df.loc[df['PRODUCTTYPE'] == 1 , 'PRODUCTTYPE'] = "AT1"
df.loc[df['PRODUCT_CATEGORY_DESC'] == 0 , 'PRODUCT_CATEGORY_DESC'] = "Revocable"
df.loc[df['PRODUCT_CATEGORY_DESC'] == 1 , 'PRODUCT_CATEGORY_DESC'] = "Irrevocable"
df.loc[df['PRODUCT_DESC'] == 0 , 'PRODUCT_DESC'] = "Immediate"
df.loc[df['PRODUCT_DESC'] == 1 , 'PRODUCT_DESC'] = ""
我做了一些非常丑陋的尝试,开始类似于下面的内容,但除了冗长之外它很慢并且如果所有列中的最大级别数是<而且还添加了不必要的行。 50:
e = df.describe()
table2 = pd.DataFrame({
'Variable Name': e.columns,
})
for n in e.columns:
for i in range(50):
grouped = df.groupby([n]).size().reset_index()
grouped = grouped.sort_values(0, ascending=False)
table2 = pd.concat([table2, grouped], ignore_index=True, axis=1)
这是我最终要做的一个例子(注意:计数是由与上述数据不完全对应的数字组成的)。您不必处理Variable Name
和Percent
(如果您这样做,可以获得奖励积分!):
答案 0 :(得分:0)
解决方案的关键在于@JonClements的评论:
table2 = df.melt().groupby(['variable', 'value']).size()
从那里我刚刚添加了一些逻辑来截断并转换结果:
table2 = table2.to_frame(name='Count')
table2 = table2.reset_index(inplace=False)
table2['Percent'] = table2['Count'] / len(df.index)
for v in table2['variable'].unique():
tmp = table2[table2.variable.str.contains(v) == True]
table2 = table2[table2.variable.str.contains(v) == False]
if tmp.shape[0] > 50:
tmp0 = tmp.iloc[:50,]
tmp1 = pd.DataFrame([{'variable':v,
'value': 'Other',
'Count':tmp.shape[0]-50,
'Percent':sum(tmp0['Percent'])
}])
tmp = tmp0.append(tmp1)
table2 = table2.append(tmp)
print(table2)