Question

请参阅我之前发布的问题：Efficient use of Numpy to process in blocks of rows

我对熊猫有很好的指导（感谢@jdehesa），但我真的需要和numpy一起工作。我主要关心的是将切片合并为一个的方式，即：

dfconcat = np.concatenate((dfconcat, dfslice),axis=0)

这似乎是组合切片的一种超级低效的方式，我觉得应该可以在循环外的一个步骤中完成此操作（可能是通过向引用每个AccountID的dfslice数组中添加另一个维度）？我的方法总体上是正确的，还是有更好的方法呢？到目前为止的工作方式：

import pandas as pd
df = pd.DataFrame({'AccountID': [1,1,1,2,1,2,1,2,2],
                   'RefDay':    [1,2,3,1,4,2,5,3,4],
                   'BCol':      [1,2,np.nan,1,3,2,1,np.nan,2],
                   'CCol':      [3,2,3,1,3,4,5,2,1]})
df = df[['AccountID','RefDay','BCol','CCol']] #sorting out order

df['TargetCol']=np.nan
dfnum = df.to_records(index=False)
dfnum = np.sort(dfnum, order=['AccountID','RefDay']) #make sure the order is correct

uniquelist = np.unique(dfnum['AccountID'])
for u in range(0,len(uniquelist)):
    dfslice = dfnum[dfnum['AccountID'] == uniquelist[u]]
    for i in range(0,len(dfslice)):
        if (len(dfslice) - i) >= 3:
            dfslice['TargetCol'][i] = np.nansum(dfslice['BCol'][i:i+3]) / dfslice['CCol'][i]
        else:
            dfslice['TargetCol'][i] = np.NaN
    if u==0:
        dfconcat = dfslice
    else:
        dfconcat = np.concatenate((dfconcat, dfslice),axis=0)

pd.DataFrame(dfconcat)

OUT：

AccountID   RefDay  BCol    CCol    TargetCol
1           1       1.0     3       1.000000
1           2       2.0     2       2.500000
1           3       NaN     3       1.333333
1           4       3.0     3       NaN
1           5       1.0     5       NaN
2           1       1.0     1       3.000000
2           2       2.0     4       1.000000
2           3       NaN     2       NaN
2           4       2.0     1       NaN

Answer 1

免责声明：我对Panda没有任何经验。

首先，我认为添加额外的轴肯定可以使您走上正确的轨道。

您可以尝试预先创建数组，这样就不必调整其大小。将长度为len（uniquelist）的DataFrames数组初始化为最终数组（将其称为myarray，因为dfconcat稍后会误导您），如果它正确，则基本上只是一个np.ndarray。这将防止发生许多调整大小的情况，这可能会导致将其复制到具有足够连续内存的位置。我认为这是您可以取得的最大胜利。

在执行此操作时，您将不使用连接，而只能分配：myarray [u] = dfslice，因为您知道元素数量正确。或直接写入最终数组，跳过dfslice构造

编辑：由于代码不正确，因此删除了代码，因为该数组将缩短。但是，我无法弄清楚您的代码将在何处得到补偿。我很抱歉，如果这个答案有点不清楚。重要部分：
1）抓住附加轴
2）在填充数组之前，先创建完整尺寸的数组。

Answer 2

首先解决concatenate上的问题，更有效地收集列表中的值，例如alist.append(...)。并在最后一次创建数组。循环中重复的连接速度较慢。

如果不使用熊猫，我认为您的dfnum数组可以构造为

In [162]: adict={'AccountID': [1,1,1,2,1,2,1,2,2],
     ...:                    'RefDay':    [1,2,3,1,4,2,5,3,4],
     ...:                    'BCol':      [1,2,np.nan,1,3,2,1,np.nan,2],
     ...:                    'CCol':      [3,2,3,1,3,4,5,2,1]}
     ...:                    
In [163]: 
In [163]: len(adict['CCol'])
Out[163]: 9

In [166]: arr = np.zeros(9, [('AccountID',int),('RefDay',int),('BCol',float),('CCol',int)])
In [167]: for n in arr.dtype.names:
     ...:     arr[n] = adict[n]
     ...:     
In [168]: arr
Out[168]: 
array([(1, 1,  1., 3), (1, 2,  2., 2), (1, 3, nan, 3), (2, 1,  1., 1),
       (1, 4,  3., 3), (2, 2,  2., 4), (1, 5,  1., 5), (2, 3, nan, 2),
       (2, 4,  2., 1)],
      dtype=[('AccountID', '<i8'), ('RefDay', '<i8'), ('BCol', '<f8'), ('CCol', '<i8')])

In [171]: dfnum = np.sort(arr, order=['AccountID','RefDay'])
In [174]: np.unique(dfnum['AccountID'], return_index=True, return_inverse=True)
Out[174]: (array([1, 2]), array([0, 5]), array([0, 0, 0, 0, 0, 1, 1, 1, 1]))

这是没有列表附加的循环。

In [182]: alist =[]; targets=[]
     ...: for u in range(0,len(uniquelist)):
     ...:     dfslice = dfnum[dfnum['AccountID'] == uniquelist[u]]
     ...:     target = np.zeros(len(dfslice))
     ...:     for i in range(0,len(dfslice)):
     ...:         if (len(dfslice) - i) >= 3:
     ...:             target[i] = np.nansum(dfslice['BCol'][i:i+3]) / dfslice['C
     ...: Col'][i]
     ...:         else:
     ...:             target[i] = np.NaN
     ...:     alist.append(dfslice)
     ...:     targets.append(target)
     ...:     
     ...:     
In [183]: alist
Out[183]: 
[array([(1, 1,  1., 3), (1, 2,  2., 2), (1, 3, nan, 3), (1, 4,  3., 3),
        (1, 5,  1., 5)],
       dtype=[('AccountID', '<i8'), ('RefDay', '<i8'), ('BCol', '<f8'), ('CCol', '<i8')]),
 array([(2, 1,  1., 1), (2, 2,  2., 4), (2, 3, nan, 2), (2, 4,  2., 1)],
       dtype=[('AccountID', '<i8'), ('RefDay', '<i8'), ('BCol', '<f8'), ('CCol', '<i8')])]
In [184]: targets
Out[184]: 
[array([1.        , 2.5       , 1.33333333,        nan,        nan]),
 array([ 3.,  1., nan, nan])]

我错过了TargetCol字段的添加，因此不得不捏造target。

In [185]: np.concatenate(alist)
Out[185]: 
array([(1, 1,  1., 3), (1, 2,  2., 2), (1, 3, nan, 3), (1, 4,  3., 3),
       (1, 5,  1., 5), (2, 1,  1., 1), (2, 2,  2., 4), (2, 3, nan, 2),
       (2, 4,  2., 1)],
      dtype=[('AccountID', '<i8'), ('RefDay', '<i8'), ('BCol', '<f8'), ('CCol', '<i8')])
In [186]: np.concatenate(targets)
Out[186]: 
array([1.        , 2.5       , 1.33333333,        nan,        nan,
       3.        , 1.        ,        nan,        nan])

我本以为通过添加Out[174]的输出，我可以不用循环就可以解决问题，但是我还没有弄清楚细节。

Python Numpy切片的有效串联

2 个答案: