我有这样的csv,
Market_id Metric New_Mar_19 New_Feb_19 tot_Mar_19 tot_Feb_19
7528 1 9 7 6.5 8
7528 2 6 3.2 2.2 3
我想将上面的csv转换成下面这样,
Month Nrx_count Nrx_quantity Trx_count Trx_quantity
Mar_19 9 6 6.5 2.2
Feb_19 7 3.2 8 3
我尝试过使用数据透视和pd.melt,但我不能喜欢上面的内容。
答案 0 :(得分:3)
想法是将没有_
的列转换为MultiIndex
,然后按_
拆分列,按DataFrame.stack
进行整形,然后按DataFrame.unstack
进行重塑:
df1 = df.set_index(['Market_id','Metric'])
df1.columns = df1.columns.str.split('_', n=1, expand=True)
df1 = df1.stack()
print (df1)
New tot
Market_id Metric
7528 1 Feb_19 7.0 8.0
Mar_19 9.0 6.5
2 Feb_19 3.2 3.0
Mar_19 6.0 2.2
#second level, here Metric is converted to columns
df1 = df1.unstack(1)
df1.columns = [f'{a}_{b}' for a, b in df1.columns]
df1 = df1.reset_index().rename(columns={'level_1':'y'})
print (df1)
Market_id y New_1 New_2 tot_1 tot_2
0 7528 Feb_19 7.0 3.2 8.0 3.0
1 7528 Mar_19 9.0 6.0 6.5 2.2
#first level, here Market_id is converted to columns
df1 = df1.unstack(0)
df1.columns = [f'{a}_{b}' for a, b in df1.columns]
df1 = df1.reset_index().rename(columns={'level_1':'y'})
print (df1)
Metric y New_7528 tot_7528
0 1 Feb_19 7.0 8.0
1 1 Mar_19 9.0 6.5
2 2 Feb_19 3.2 3.0
3 2 Mar_19 6.0 2.2
编辑:
如果出现错误:
ValueError:索引包含重复的条目,无法重塑
表示重复。解决方案是相同的,只需要添加聚合-例如由mean
,sum
...
#change data for duplicates in column for Multiindex - Market_id, Metric
print (df)
Market_id Metric New_Mar_19 New_Feb_19 tot_Mar_19 tot_Feb_19
0 7528 1 2 5.0 5.0 8
1 7528 1 10 20.0 2.5 16
2 7528 2 6 3.2 2.2 3
df1 = df.set_index(['Market_id','Metric'])
df1.columns = df1.columns.str.split('_', n=1, expand=True)
df1 = df1.stack()
print (df1)
New tot
Market_id Metric
7528 1 Feb_19 5.0 8.0
Mar_19 2.0 5.0
Feb_19 20.0 16.0
Mar_19 10.0 2.5
2 Feb_19 3.2 3.0
Mar_19 6.0 2.2
#number of levels passes to groupby - here 3 levels -> 0,1,2
print (df1.index.nlevels)
3
#grouping by all levels and aggregate mean, sum,max...
df1 = df1.groupby(level=[0,1,2]).mean().unstack(1)
print (df1)
New tot
Metric 1 2 1 2
Market_id
7528 Feb_19 12.5 3.2 12.00 3.0
Mar_19 6.0 6.0 3.75 2.2
df1.columns = [f'{a}_{b}' for a, b in df1.columns]
df1 = df1.reset_index().rename(columns={'level_1':'y'})
print (df1)
Market_id y New_1 New_2 tot_1 tot_2
0 7528 Feb_19 12.5 3.2 12.00 3.0
1 7528 Mar_19 6.0 6.0 3.75 2.2