基于列值拆分/扩展数据框

时间:2021-03-17 15:52:35

标签: python pandas

我有一个像下面这样的 DataFrame,将标识符作为现有日期索引之上的一列。

pd.DataFrame(index = [pd.to_datetime('2021-01-01'), pd.to_datetime('2021-01-01'),pd.to_datetime('2021-01-02'),pd.to_datetime('2021-01-02'), pd.to_datetime('2021-01-03'),pd.to_datetime('2021-01-03')], columns=['id','A', 'B'], data=[['foo',1,5],['bar',8,12],['foo',7,1], ['bar',5,1], ['foo',4,3],['bar',7,1]])

Out[6]: 
             id  A   B
2021-01-01  foo  1   5
2021-01-01  bar  8  12
2021-01-02  foo  7   1
2021-01-02  bar  5   1
2021-01-03  foo  4   3
2021-01-03  bar  7   1

我的目标是为除 id 之外的每一列(A 和 B)创建一个新的子数据框,其中 dateIndex 作为单个索引,id (foo, bar) 作为列名。预期输出如下所示:

A
Out[9]: 
            foo  bar
2021-01-01    1    8
2021-01-02    7    5
2021-01-03    4    7

B
Out[11]: 
            foo  bar
2021-01-01    5   12
2021-01-02    1    1
2021-01-03    3    1

5 个答案:

答案 0 :(得分:20)

A, B = map(df.set_index('id', append=True).unstack().get, ['A', 'B'])

print(A)

id          bar  foo
2021-01-01    8    1
2021-01-02    5    7
2021-01-03    7    4

print(B)

id          bar  foo
2021-01-01   12    5
2021-01-02    1    1
2021-01-03    1    3

答案 1 :(得分:10)

这很简单:

out = df.set_index('id',append=True).unstack('id')
# if you have columns other than `A`,`B`:
# out = df.set_index('id',append=True)[['A','B']].unstack('id')

那你就可以了

out['A']

给出:

id          bar  foo
2021-01-01    8    1
2021-01-02    5    7
2021-01-03    7    4

out['B'] 类似。我发现这比将变量硬编码为 A,B 更容易且不易出错。

答案 2 :(得分:6)

编辑:结合@piRSquared 的好主意,除了使用 map 之外,还使用 ​​pivot

IIn [58]: A, B = map(lambda column: df[['id', column]].pivot(columns='id', values=column), ['A', 'B'])

In [59]: A
Out[59]: 
id          bar  foo
date                
2021-01-01    8    1
2021-01-02    5    7
2021-01-03    7    4

In [60]: B
Out[60]: 
id          bar  foo
date                
2021-01-01   12    5
2021-01-02    1    1
2021-01-03    1    3

答案 3 :(得分:6)

试试

out = df.set_index('id',append=True).stack().unstack('id').swaplevel(0,1)
A = out.loc['A',:]
A
Out[325]: 
id          bar  foo
2021-01-01    8    1
2021-01-02    5    7
2021-01-03    7    4

d = {x : df[[x,'id']].pivot(columns='id',values=x)  for x in ['A','B']}
d['A']
Out[336]: 
id          bar  foo
2021-01-01    8    1
2021-01-02    5    7
2021-01-03    7    4

df.pivot(columns='id').loc[:,'A']
Out[340]: 
id          bar  foo
2021-01-01    8    1
2021-01-02    5    7
2021-01-03    7    4

答案 4 :(得分:4)

您可以使用 pandas.DataFrame.xs 获取 foo barset_index 之后的 id idswaplevel 的值:

>>> A, B = map(df.set_index('id', append=True).swaplevel(0,1).xs, ['foo', 'bar'])
>>> A
            A  B
2021-01-01  1  5
2021-01-02  7  1
2021-01-03  4  3

>>> B
            A   B
2021-01-01  8  12
2021-01-02  5   1
2021-01-03  7   1

或者在 level 中使用 xs 参数可以节省 swaplevel

>>> A, B = (df.set_index('id', append=True).xs(ID, level=1) for ID in ['foo', 'bar'])

# This can be made more readable by creating a `partial` function:

>>> from functools import partial
>>> def get_by_ID(df, level, col='id'):
...     func = partial(df.set_index(col, append=True).xs, level=level)
...     return func
>>> A, B = map(get_by_ID(df=df, level=1), ['foo', 'bar'])

或者,简单地说:

>>> A, B = (df.loc[df.id == ID, ['A', 'B']] for ID in df.id.unique())

性能

>>> %timeit A, B = map(df.set_index('id', append=True).unstack().get, ['A', 'B'])
2.22 ms ± 33.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit A, B =  map(df.set_index('id', append=True).swaplevel(0,1).xs, ['foo', 'bar'])
1.88 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit A, B = map(get_by_ID(df=df, level=1), ['foo', 'bar'])
1.73 ms ± 54.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit A, B = (df.loc[df.id == ID, ['A', 'B']] for ID in df.id.unique())
1.69 ms ± 43 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)