获取每个2d数组的累积计数

时间:2018-12-04 14:53:39

标签: python arrays numpy counter cumulative-sum

我有一些常规数据,例如字符串:

<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.CALL_PHONE" />
<uses-permission android:name="android.permission.VIBRATE" />

<application
    android:name=".Networking.Firebase.FireBaseApp"
    android:allowBackup="true"
    android:icon="@mipmap/ic_launcher"
    android:label="@string/app_name"
    android:roundIcon="@mipmap/ic_launcher_round"
    android:supportsRtl="true"
    android:theme="@style/AppTheme">
    <activity android:name=".Activities.MainActivity"/>
    <activity
        android:name=".Activities.SplashActivity"
        android:noHistory="true">
        <intent-filter>
            <action android:name="android.intent.action.MAIN"/>
            <category android:name="android.intent.category.LAUNCHER"/>
        </intent-filter>

    </activity>
    <service android:name=".Networking.Firebase.PushMessageService" />
    <receiver android:name=".Networking.Firebase.PushReceiver"/>
</application>

如果累计值的计数器存在差异,我需要重新设置计数,所以使用大熊猫。

首先创建DataFrame:

np.random.seed(343)

arr = np.sort(np.random.randint(5, size=(10, 10)), axis=1).astype(str)
print (arr)
[['0' '1' '1' '2' '2' '3' '3' '4' '4' '4']
 ['1' '2' '2' '2' '3' '3' '3' '4' '4' '4']
 ['0' '2' '2' '2' '2' '3' '3' '4' '4' '4']
 ['0' '1' '2' '2' '3' '3' '3' '4' '4' '4']
 ['0' '1' '1' '1' '2' '2' '2' '2' '4' '4']
 ['0' '0' '1' '1' '2' '3' '3' '3' '4' '4']
 ['0' '0' '2' '2' '2' '2' '2' '2' '3' '4']
 ['0' '0' '1' '1' '1' '2' '2' '2' '3' '3']
 ['0' '1' '1' '2' '2' '2' '3' '4' '4' '4']
 ['0' '1' '1' '2' '2' '2' '2' '2' '4' '4']]

它如何作用于一列:

首先比较已移动的数据并添加累积总和:

df = pd.DataFrame(arr)
print (df)
   0  1  2  3  4  5  6  7  8  9
0  0  1  1  2  2  3  3  4  4  4
1  1  2  2  2  3  3  3  4  4  4
2  0  2  2  2  2  3  3  4  4  4
3  0  1  2  2  3  3  3  4  4  4
4  0  1  1  1  2  2  2  2  4  4
5  0  0  1  1  2  3  3  3  4  4
6  0  0  2  2  2  2  2  2  3  4
7  0  0  1  1  1  2  2  2  3  3
8  0  1  1  2  2  2  3  4  4  4
9  0  1  1  2  2  2  2  2  4  4

然后致电GroupBy.cumcount

a = (df[0] != df[0].shift()).cumsum()
print (a)
0    1
1    2
2    3
3    3
4    3
5    3
6    3
7    3
8    3
9    3
Name: 0, dtype: int32

如果可能要对所有列应用解决方案,请使用b = a.groupby(a).cumcount() + 1 print (b) 0 1 1 1 2 1 3 2 4 3 5 4 6 5 7 6 8 7 9 8 dtype: int64

apply

但是它很慢,因为大数据。是否可以创建一些快速的numpy解决方案?

我发现solutions仅适用于一维数组。

3 个答案:

答案 0 :(得分:8)

总体思路

考虑一般情况下,我们执行此累计计数,或者如果您将其视为范围,则可以将它们称为“分组范围”。

现在,这个想法从简单开始-比较各个轴上的一次性切片以查找不等式。在每行/列的开头用True填充(取决于计数轴)。

然后,它变得很复杂-设置一个ID数组,目的是使我们获得最终的总和,以其展平的顺序输出期望的总和。因此,设置从初始化具有与输入数组相同形状的1s数组开始。在每个输入组开始时,使ID数组偏移先前的组长度。遵循代码(应该提供更多见解),了解如何对每一行进行操作-

def grp_range_2drow(a, start=0):
    # Get grouped ranges along each row with resetting at places where
    # consecutive elements differ

    # Input(s) : a is 2D input array

    # Store shape info
    m,n = a.shape

    # Compare one-off slices for each row and pad with True's at starts
    # Those True's indicate start of each group
    p = np.ones((m,1),dtype=bool)
    a1 = np.concatenate((p, a[:,:-1] != a[:,1:]),axis=1)

    # Get indices of group starts in flattened version
    d = np.flatnonzero(a1)

    # Setup ID array to be cumsumed finally for desired o/p 
    # Assign into starts with previous group lengths. 
    # Thus, when cumsumed on flattened version would give us flattened desired
    # output. Finally reshape back to 2D  
    c = np.ones(m*n,dtype=int)
    c[d[1:]] = d[:-1]-d[1:]+1
    c[0] = start
    return c.cumsum().reshape(m,n)

我们将其扩展为解决行和列的一般情况。对于列的情况,我们将简单地进行转置,馈入较早的行解决方案,然后最终转回,就像这样-

def grp_range_2d(a, start=0, axis=1):
    # Get grouped ranges along specified axis with resetting at places where
    # consecutive elements differ

    # Input(s) : a is 2D input array

    if axis not in [0,1]:
        raise Exception("Invalid axis")

    if axis==1:
        return grp_range_2drow(a, start=start)
    else:
        return grp_range_2drow(a.T, start=start).T

样品运行

让我们考虑一个样本运行,它会发现沿每列的分组范围,每个分组均以1-

In [330]: np.random.seed(0)

In [331]: a = np.random.randint(1,3,(10,10))

In [333]: a
Out[333]: 
array([[1, 2, 2, 1, 2, 2, 2, 2, 2, 2],
       [2, 1, 1, 2, 1, 1, 1, 1, 1, 2],
       [1, 2, 2, 1, 1, 2, 2, 2, 2, 1],
       [2, 1, 2, 1, 2, 2, 1, 2, 2, 1],
       [1, 2, 1, 2, 2, 2, 2, 2, 1, 2],
       [1, 2, 2, 2, 2, 1, 2, 1, 1, 2],
       [2, 1, 2, 1, 2, 1, 1, 1, 1, 1],
       [2, 2, 1, 1, 1, 2, 2, 1, 2, 1],
       [1, 2, 1, 2, 2, 2, 2, 2, 2, 1],
       [2, 2, 1, 1, 2, 1, 1, 2, 2, 1]])

In [334]: grp_range_2d(a, start=1, axis=0)
Out[334]: 
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
       [1, 1, 1, 1, 2, 1, 1, 1, 1, 1],
       [1, 1, 2, 2, 1, 2, 1, 2, 2, 2],
       [1, 1, 1, 1, 2, 3, 1, 3, 1, 1],
       [2, 2, 1, 2, 3, 1, 2, 1, 2, 2],
       [1, 1, 2, 1, 4, 2, 1, 2, 3, 1],
       [2, 1, 1, 2, 1, 1, 1, 3, 1, 2],
       [1, 2, 2, 1, 1, 2, 2, 1, 2, 3],
       [1, 3, 3, 1, 2, 1, 1, 2, 3, 4]])

因此,要解决我们有关数据帧输入和输出的问题,应该是-

out = grp_range_2d(df.values, start=1,axis=0)
pd.DataFrame(out,columns=df.columns,index=df.index)

答案 1 :(得分:6)

还有numba解决方案。对于这种棘手的问题,它总是胜出,这里是numpy的7倍,因为只有一次传递res。

from numba import njit 
@njit
def thefunc(arrc):
    m,n=arrc.shape
    res=np.empty((m+1,n),np.uint32)
    res[0]=1
    for i in range(1,m+1):
        for j in range(n):
            if arrc[i-1,j]:
                res[i,j]=res[i-1,j]+1
            else : res[i,j]=1
    return res 

def numbering(arr):return thefunc(arr[1:]==arr[:-1])

我需要外部化arr[1:]==arr[:-1],因为numba不支持字符串。

In [75]: %timeit numbering(arr)
13.7 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [76]: %timeit grp_range_2dcol(arr)
111 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

对于更大的数组(100 000行x 100列),间距不是那么宽:

In [168]: %timeit a=grp_range_2dcol(arr)
1.54 s ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [169]: %timeit a=numbering(arr)
625 ms ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

如果arr可以转换为'S8',我们可以赢得很多时间:

In [398]: %timeit arr[1:]==arr[:-1]
584 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [399]: %timeit arr.view(np.uint64)[1:]==arr.view(np.uint64)[:-1]
196 ms ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

答案 2 :(得分:2)

使用Divakar列明智的方法要快得多,即使有可能是完全矢量化的方法。

function copyToClipboard(text)
{

    var copyElement = document.createElement("span");
    copyElement.appendChild(document.createTextNode(text));
    copyElement.id = 'tempCopyToClipboard';
    angular.element(document.body.append(copyElement));

    // select the text
    var range = document.createRange();
    range.selectNode(copyElement);
    window.getSelection().removeAllRanges();
    window.getSelection().addRange(range);

    // copy & cleanup
    document.execCommand('copy');
    window.getSelection().removeAllRanges();
    copyElement.remove();
}

要检查是否相等:

#function of Divakar
def grp_range(a):
    idx = a.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -a[:-1]+1
    return id_arr.cumsum()

#create the equivalent of (df != df.shift()).cumsum() but faster
arr_sum = np.vstack([np.ones(10), np.cumsum((arr != np.roll(arr, 1, 0))[1:],0)+1])

#use grp_range column wise on arr_sum
arr_result = np.array([grp_range(np.unique(arr_sum[:,i],return_counts=1)[1]) 
                       for i in range(arr_sum.shape[1])]).T+1

和速度:

# of the cumsum
print (((df != df.shift()).cumsum() == 
         np.vstack([np.ones(10), np.cumsum((arr != np.roll(arr, 1, 0))[1:],0)+1]))
         .all().all())
#True

print ((df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1) ==
        np.array([grp_range(np.unique(arr_sum[:,i],return_counts=1)[1]) 
                  for i in range(arr_sum.shape[1])]).T+1)
        .all().all())
#True

编辑:对于%timeit df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1) #19.4 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit arr_sum = np.vstack([np.ones(10), np.cumsum((arr != np.roll(arr, 1, 0))[1:],0)+1]) arr_res = np.array([grp_range(np.unique(arr_sum[:,i],return_counts=1)[1]) for i in range(arr_sum.shape[1])]).T+1 #562 µs ± 82.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ,您还可以将Numpynp.maximum.accumulate一起使用。

np.arange

一些 TIMING

def accumulate(arr):
    n,m = arr.shape
    arr_arange = np.arange(1,n+1)[:,np.newaxis]
    return np.concatenate([ np.ones((1,m)), 
                           arr_arange[1:] - np.maximum.accumulate(arr_arange[:-1]*
                      (arr[:-1,:] != arr[1:,:]))],axis=0)

使用arr_100 = np.sort(np.random.randint(50, size=(100000, 100)), axis=1).astype(str)

的解决方案
np.maximum.accumulate

Divakar

的解决方案
%timeit accumulate(arr_100)
#520 ms ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

用Numba解决B. M.

%timeit grp_range_2drow(arr_100.T, start=1).T
#1.15 s ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)