pandas.DataFrame:如何按索引对齐/分组和排序数据?

时间:2018-02-07 13:23:38

标签: python pandas sorting dataframe

我是大熊猫的新手,但对它的力量以及如何使用它仍然没有很好的概述。所以问题很简单:)

我有一个DataFrame,其中包含日期索引和多个列(股票及其开盘价和收盘价)。以下是两只股票AB的一些示例数据:

import pandas as pd
_ = pd.to_datetime
A_dt = [_('2018-01-04'), _('2018-01-01'), _('2018-01-05')]
B_dt = [_('2018-01-01'), _('2018-01-05'), _('2018-01-03'), _('2018-01-02')]
A_data = [(12, 11), (10, 9), (8, 9)]
B_data = [(2, 2), (3, 4), (4, 4), (5, 3)]

如您所见,数据不完整,每个系列的缺失日期不同。我想将这些数据放在一个数据框中,并使用排序的行索引dt和4列(每个库存2个库x 2个时间序列)。

当我这样做时,一切正常(除了我想更改列级别而不知道如何操作):

# MultiIndex on axis 0, then unstacking
i0_a = pd.MultiIndex.from_tuples([("A", x) for x in A_dt], names=['symbol', 'dt'])
i0_b = pd.MultiIndex.from_tuples([("B", x) for x in B_dt], names=['symbol', 'dt'])

df0_a = pd.DataFrame(A_data, index=i0_a, columns=["Open", "Close"])
df0_b = pd.DataFrame(B_data, index=i0_b, columns=["Open", "Close"])

df = pd.concat([df0_a, df0_b])

df = df.unstack('symbol')  # this automatically sorts by dt.
print df

#            Open      Close
#symbol         A    B     A    B
#dt
#2018-01-01  10.0  2.0   9.0  2.0
#2018-01-02   NaN  5.0   NaN  3.0
#2018-01-03   NaN  4.0   NaN  4.0
#2018-01-04  12.0  NaN  11.0  NaN
#2018-01-05   8.0  3.0   9.0  4.0

然而,当我在列上放置MultiIndex时,情况就不同了

# MultiIndex on axis 1
i1_a = pd.MultiIndex.from_tuples([("A", "Open"), ("A", "Close")], names=['symbol', 'series'])
i1_b = pd.MultiIndex.from_tuples([("B", "Open"), ("B", "Close")], names=['symbol', 'series'])

df1_a = pd.DataFrame(A_data, index=A_dt, columns=i1_a)
df1_b = pd.DataFrame(B_data, index=B_dt, columns=i1_b)

df = pd.concat([df1_a, df1_b])

print df

#symbol         A           B
#series     Close  Open Close Open
#2018-01-04  11.0  12.0   NaN  NaN
#2018-01-01   9.0  10.0   NaN  NaN
#2018-01-05   9.0   8.0   NaN  NaN
#2018-01-01   NaN   NaN   2.0  2.0
#2018-01-05   NaN   NaN   4.0  3.0
#2018-01-03   NaN   NaN   4.0  4.0
#2018-01-02   NaN   NaN   3.0  5.0
  1. 为什么在这种情况下数据不会自动对齐,而在另一种情况下?
  2. 如何在第二个示例中对齐和排序?
  3. 哪种方法在大型数据集上可能会更快(大约5000个库存,1000个时间步长,而不是每个库存2个系列(打开,关闭),但大约20个)?这最终将用作keras机器学习模型的输入。
  4. 编辑:通过jezraels的回答,我计划了3种不同的连接/组合DataFrames的方法。我的第一种方法是最快的。使用combine_first比其他方法慢一个数量级。在示例中,数据的大小仍然非常小:

    import timeit
    setup = """
    import pandas as pd
    import numpy as np
    
    stocks = 20
    steps = 20
    features = 10
    
    data = []
    index_method1 = []
    index_method2 = []
    cols_method1 = []
    cols_method2 = []
    
    df = None
    for s in range(stocks):
        name = "stock{0}".format(s)
        index = np.arange(steps)
        data.append(np.random.rand(steps, features))
        index_method1.append(pd.MultiIndex.from_tuples([(name, x) for x in index], names=['symbol', 'dt']))
        index_method2.append(index)
        cols_method1.append([chr(65 + x) for x in range(features)])
        cols_method2.append(pd.MultiIndex.from_arrays([[name] * features, [chr(65 + x) for x in range(features)]], names=['symbol', 'series']))
    """
    
    method1 = """
    for s in range(stocks):
        df_new = pd.DataFrame(data[s], index=index_method1[s], columns=cols_method1[s])
        if s == 0:
            df = df_new
        else:
            df = pd.concat([df, df_new])
    df = df.unstack('symbol')
    """
    
    method2 = """
    for s in range(stocks):
        df_new = pd.DataFrame(data[s], index=index_method2[s], columns=cols_method2[s])
        if s == 0:
            df = df_new
        else:
            df = df.combine_first(df_new)
    """
    
    method3 = """
    for s in range(stocks):
        df_new = pd.DataFrame(data[s], index=index_method2[s], columns=cols_method2[s])
        if s == 0:
            df = df_new.stack()
        else:
            df = pd.concat([df, df_new.stack()], axis=1)
    
    df = df.unstack().swaplevel(0,1, axis=1).sort_index(axis=1)
    """
    
    print ("Multi-Index axis 0, then concat: {} s".format((timeit.timeit(method1, setup, number=1))))
    print ("Multi-Index axis 1, combine_first: {} s".format((timeit.timeit(method2, setup, number=1))))
    print ("Stack and then concat: {} s".format((timeit.timeit(method3, setup, number=1))))
    
    Multi-Index axis 0, then concat: 0.134283173989 s
    Multi-Index axis 1, combine_first: 5.02396191049 s
    Stack and then concat: 0.272278263371 s
    

1 个答案:

答案 0 :(得分:1)

这是一个问题,因为两个DataFrame在列中都有不同的MultiIndex,所以没有对齐。

Seriesstack到2列DataFrame,然后concat以及MultiIndex添加{{}的正确顺序为unstack 3}}和swaplevel

df = (pd.concat([df1_a.stack(), df1_b.stack()], axis=1)
        .unstack()
        .swaplevel(0,1, axis=1)
        .sort_index(axis=1))
print (df)
series     Close       Open     
symbol         A    B     A    B
2018-01-01   9.0  2.0  10.0  2.0
2018-01-02   NaN  3.0   NaN  5.0
2018-01-03   NaN  4.0   NaN  4.0
2018-01-04  11.0  NaN  12.0  NaN
2018-01-05   9.0  4.0   8.0  3.0

但更好的是使用sort_index

df = df1_a.combine_first(df1_b)
print (df)
symbol         A           B     
series     Close  Open Close Open
2018-01-01   9.0  10.0   2.0  2.0
2018-01-02   NaN   NaN   3.0  5.0
2018-01-03   NaN   NaN   4.0  4.0
2018-01-04  11.0  12.0   NaN  NaN
2018-01-05   9.0   8.0   4.0  3.0