在创建名为payment_types_Owned
的维度表时,我遇到了一个问题,该表列出了客户拥有的产品数量,余额以及每次付款的限额。目前,我有一个看起来像这样的表:
cust_id Payment Type X owned Payment Type Y owned Payment Type Z owned Credit Used_X Limit_X Credit Used_Y Limit_Y Credit Used_Z Limit_Z 0 Person_A 1 3 4 300 700 700 800 400 900 1 Person_B 2 1 3 400 600 100 150 400 500 2 Person_C 2 4 4 500 600 700 800 100 500
我想要的输出:
cust_id variable value Credit Used Limit 0 Person_A_key Payment Type X 1 300 700 1 Person_A_key Payment Type Y 3 700 800 2 Person_A_key Payment Type Z 4 400 900 3 Person_B_key Payment Type X 2 400 600 4 Person_B_key Payment Type Y 1 100 150 5 Person_B_key Payment Type Z 3 400 500
假设我已经有另外两个Dimension表,它们捕获以下信息:
Customer Dimension Table
-包含cust_id主键Product Dimension Table
-包含唯一的产品主键使用pd.melt()
,我得到了以下内容,但它只能部分解决我的问题:
(pd.melt(df, id_vars=['cust_id'], value_vars=['Payment Type X owned','Payment Type Y owned', 'Payment Type Z owned'])).sort_values(by=['cust_id'])
cust_id variable value 0 Person_A Payment Type X 1 3 Person_A Payment Type Y 3 6 Person_A Payment Type Z 4 1 Person_B Payment Type X 2 4 Person_B Payment Type Y 1 7 Person_B Payment Type Z 3 2 Person_C Payment Type X 2 5 Person_C Payment Type Y 4 8 Person_C Payment Type Z 4
有什么建议吗?
答案 0 :(得分:1)
使用wide_to_long
,但首先必须将Series.str.replace
与第一组Payment Type
列一起使用:
df.columns = df.columns.str.replace(' owned', '').str.replace('Payment Type ', 'Payment Type_')
print (df)
cust_id Payment Type_X Payment Type_Y Payment Type_Z Credit Used_X \
0 Person_A 1 3 4 300
1 Person_B 2 1 3 400
2 Person_C 2 4 4 500
Limit_X Credit Used_Y Limit_Y Credit Used_Z Limit_Z
0 700 700 800 400 900
1 600 100 150 400 500
2 600 700 800 100 500
df1 = pd.wide_to_long(df, stubnames=['Payment Type','Credit Used', 'Limit'],
i='cust_id',
j='variable',
sep='_',
suffix='\w+').sort_index(level=0).reset_index()
最后将字符串添加到variable
列,然后按dict重命名该列:
df1 = (df1.assign(variable='Payment Type ' + df1['variable'])
.rename(columns={'Payment Type':'value'}))
print(df1)
cust_id variable value Credit Used Limit
0 Person_A Payment Type X 1 300 700
1 Person_A Payment Type Y 3 700 800
2 Person_A Payment Type Z 4 400 900
3 Person_B Payment Type X 2 400 600
4 Person_B Payment Type Y 1 100 150
5 Person_B Payment Type Z 3 400 500
6 Person_C Payment Type X 2 500 600
7 Person_C Payment Type Y 4 700 800
8 Person_C Payment Type Z 4 100 500
答案 1 :(得分:0)
如果您可以将列组织为具有第一级'Payment Type X'
的多索引...,则有一个相对简单的解决方案(在此发布的末尾,您将找到使该数据框具有该格式的代码)。
如上所述,在列上使用multiindex时,以下代码将产生输出:
result= None
for col_group in set(df.columns.get_level_values(0)):
df_group= df[col_group].assign(variable=col_group).set_index('variable', append=True)
if result is None:
result= df_group
else:
result= pd.concat([result, df_group], axis='index')
result.sort_index(inplace=True)
执行后的变量结果包含一个数据框,如下所示:
owned Credit Used Limit
cust_id variable
Person_A Payment Type X 1 300 700
Payment Type Y 3 700 800
Payment Type Z 4 400 900
Person_B Payment Type X 2 400 600
Payment Type Y 1 100 150
Payment Type Z 3 400 500
Person_C Payment Type X 2 500 600
Payment Type Y 4 700 800
Payment Type Z 4 100 500
以下代码创建测试数据并重新组织上面使用的列:
import pandas as pd
import io
raw=\
""" cust_id Payment Type X owned Payment Type Y owned Payment Type Z owned Credit Used_X Limit_X Credit Used_Y Limit_Y Credit Used_Z Limit_Z
0 Person_A 1 3 4 300 700 700 800 400 900
1 Person_B 2 1 3 400 600 100 150 400 500
2 Person_C 2 4 4 500 600 700 800 100 500"""
df= pd.read_csv(io.StringIO(raw), sep=' +', engine='python')
df.set_index(['cust_id'], inplace=True)
new_cols= list()
for col in df.columns:
if 'X' in col:
lv1= 'Payment Type X'
elif 'Y' in col:
lv1= 'Payment Type Y'
elif 'Z' in col:
lv1= 'Payment Type Z'
else:
lv1= col
if col[-2:-1] == '_':
lv2= col[:-2]
elif col.endswith(' owned'):
lv2= 'owned'
else:
lv2= col
new_cols.append((lv1, lv2))
df.columns= pd.MultiIndex.from_tuples(new_cols)
一个更激进的方法仅需一个步骤,就像这样:
flat= df_orig.melt(id_vars=['cust_id'], var_name='column')
flat['variable']= ''
flat.loc[flat['column'].str.match('.*[_ ]X.*'), 'variable']= 'Payment Type X'
flat.loc[flat['column'].str.match('.*[_ ]Y.*'), 'variable']= 'Payment Type Y'
flat.loc[flat['column'].str.match('.*[_ ]Z.*'), 'variable']= 'Payment Type Z'
flat['column']= flat['column'].str.replace('[_ ][XYZ]', '').str.replace('Payment Type owned', 'Owned')
flat.set_index(['cust_id', 'variable', 'column'], inplace=True)
result= flat.unstack().droplevel(0, axis='columns')
它更为激进,因为它完全分解了原始数据帧以重建它。它可能比第一种方法效率低。