对于以下熊猫数据框:
定义:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3], 're_foo':[1,2,3], 're_bar':[4,5,6], 're_foo_baz':[0.4, 0.8, .9], 're_bar_baz':[.4,.5,.6], 'iteration':[1,2,3]})
display(df)
我想重塑为以下格式:
id, metric_kind, foo , bar , iteration
1, regular , 1 , 4 , 1
1, baz , 0.4 , 0.4 , 1
...
从pandas reshape multiple columns fails with KeyError我了解到:
df.set_index(['id','iteration']).stack()#.reset_index().rename(columns={'level_2':'metric', 0: 'value'})
将输出不同的元组,但我想同时保留两个元组的值。
dx = df[['id', 'foo', 'bar', 'iteration']].copy()
dx['kind'] = 'regular'
dx = pd.concat([dx, df[['id', 'foo_baz', 'bar_baz', 'iteration']]], axis=0)
dx['kind'] = dx['kind'].fillna('baz')
dx.loc[dx.foo.isnull(), 'foo'] = dx.foo_baz
# now fill other NULL values
会失败并显示:
ValueError: cannot reindex from a duplicate axis
instead.
我看到了一个更聪明的fillna:
dx.foo = dx.foo.fillna(dx.foo_baz)
dx.bar = dx.bar.fillna(dx.bar_baz)
dx = dx.drop(['foo_baz', 'bar_baz'], axis= 1)
完成工作-但这似乎很笨拙。有更好的方法吗?
答案 0 :(得分:1)
这是一种使用pd.wide_to_long
的方法,然后使用stack
和unstack
来交换轴。
# rename columns with number as prefix, so we can identify different groups
dct = {col: f'{cnt}_{col.split("_")[1]}'
for col, cnt in zip(df.columns, df.columns.str.count('_'))
if cnt > 0}
df = df.rename(columns=dct)
dfn = pd.wide_to_long(df,
stubnames=['1_', '2_'],
i=['id', 'iteration'],
j='metric_kind',
suffix='[A-Za-z]+')
dfn = (
dfn.stack()
.unstack('metric_kind')
.reset_index()
.rename(columns={'level_2':'metric_kind'})
.rename_axis(None, axis=1)
)
dfn['metric_kind'] = dfn['metric_kind'].map({'1_': 'regular', '2_': 'baz'})
输出
id iteration metric_kind bar foo
0 1 1 regular 4.0 1.0
1 1 1 baz 0.4 0.4
2 2 2 regular 5.0 2.0
3 2 2 baz 0.5 0.8
4 3 3 regular 6.0 3.0
5 3 3 baz 0.6 0.9
使用DataFrame.filter
和pd.concat
:
d1 = df.filter(regex='id|_foo$|_bar$|iteration')
d2 = df.filter(regex='id|_baz$|iteration').rename(columns=lambda x: x.replace('_baz', ''))
dfn = pd.concat([d1, d2]).sort_values('id').reset_index(drop=True)
dfn['metric_kind'] = np.resize(['regular', 'baz'], len(dfn))
print(dfn)
id re_foo re_bar iteration metric_kind
0 1 1.0 4.0 1 regular
1 1 0.4 0.4 1 baz
2 2 2.0 5.0 2 regular
3 2 0.8 0.5 2 baz
4 3 3.0 6.0 3 regular
5 3 0.9 0.6 3 baz
答案 1 :(得分:1)
我的方法是提取相关部分和stack
:
s = df.set_index(['id', 'iteration'])
s.columns = pd.MultiIndex.from_frame(s.columns
.str.extract('([^_]*_[^_]*)_?([^_]*)')
.replace('', 'regular')
)
s.stack(1).reset_index()
输出:
0 id iteration 1 re_bar re_foo
0 1 1 baz 0.4 0.4
1 1 1 regular 4.0 1.0
2 2 2 baz 0.5 0.8
3 2 2 regular 5.0 2.0
4 3 3 baz 0.6 0.9
5 3 3 regular 6.0 3.0