Question

我的数据在数据库中并标准化为：

date1,key1,val1
date1,key2,val2
date1,keyN,valN
...

我想将数据加载到具有预先已知的给定尺寸（D x K）的numpy 2-d数组中。 D是已知日期列表的编号，K是已知列的列表编号。注意，K可能大于或小于从数据库返回的实际键，取决于查询，并且可能具有与DataFrame.unstack（）方法返回的列的默认顺序不同的顺序。有没有简单的方法来实现这个目标？

这是我想要实现的一个简单示例：df是3x3（DxK），给出了它。查询的数据在df2中（可以是数据库中的任何行。在以下示例中仅为3行）。如果我只是取消堆栈df2，它会给出一个2x2的DataFrame（矩阵）（2个日期和2个键），这不是我想要的。我想转向/取消堆栈df2，使其与df具有相同的索引和列，即3x3。失踪中充满了南方。

np.random.seed(456)
df = pd.DataFrame(np.random.randn(3,3), columns=['b', 'a', 'c'], 
    index=[date(2018,1,1), date(2018,1,2), date(2018,1,3)]) 
# df2 = pd.read_sql_query(sql, cn)
# for illustration, it is assumed to have 3 rows as below
df2 = pd.DataFrame({'val':[1,2,3]}, 
    index=pd.MultiIndex.from_arrays(
    [[date(2018,1,1), date(2018,1,2), date(2018,1,2)], 
    ['a','a','c']], names=['date','key']))

print (df)
               b         a         c
2018-01-01 -0.668129 -0.498210  0.618576
2018-01-02  0.568692  1.350509  1.629589
2018-01-03  0.301966  0.449483 -0.345811

print (df2)
                val
date       key
2018-01-01 a      1
2018-01-02 a      2
           c      3

# this will only generate 2-d array with 2x2, 
# I want to generate 3x3, with same (order) of index/columns as in df.
df2['val'].unstack('key')  
    key       a    c
date
2018-01-01  1.0  NaN
2018-01-02  2.0  3.0

# pivot() result not expected as well.
pd.pivot(index=df.index, columns=df.columns, values=df2['val'])
              a    b    c
2018-01-01  NaN  1.0  NaN
2018-01-02  2.0  NaN  NaN
2018-01-03  NaN  NaN  3.0

## expected result
              b    a    c
2018-01-01  NaN  1.0  NaN
2018-01-02  NaN  2.0  3.0
2018-01-03  NaN  NaN  NaN

从matlab，可以轻松创建纳米矩阵（DxK），并使用sub2ind索引日期和键，并将值分配给矩阵。（如果有重复，它将使用副本的最后一个值。）此外，人们可以使用accumarray和sum / mean / nansum轻松处理聚合部分....我正在将matlab代码转换为numpy ，并试图找到一种更有效地处理这种情况的成语方式。在现实生活中，D> 5000，K> 5000，因此效率是另一个重要因素。

pandas DataFrame在给定列中取消堆栈

0 个答案: