使用随机长度的块生成1d numpy

时间:2016-01-08 12:14:57

标签: python arrays numpy

我需要生成一维数组,其中重复的整数序列由随机数的零分隔。

到目前为止,我正在使用下一个代码:

from random import normalvariate

regular_sequence = np.array([1,2,3,4,5], dtype=np.int)
n_iter = 10
lag_mean = 10 # mean length of zeros sequence
lag_sd = 1 # standard deviation of zeros sequence length

# Sequence of lags lengths
lag_seq = [int(round(normalvariate(lag_mean, lag_sd))) for x in range(n_iter)]

# Generate list of concatenated zeros and regular sequences
seq = [np.concatenate((np.zeros(x, dtype=np.int), regular_sequence)) for x in lag_seq]
seq = np.concatenate(seq)

当我需要很多长序列时它可以工作但看起来很慢。那么,我该如何优化呢?

4 个答案:

答案 0 :(得分:4)

您可以预先计算要放置重复regular_sequence元素的索引,然后以矢量化方式设置regular_sequence元素。对于预先计算这些索引,可以使用np.cumsum来获取regular_sequence的每个的开头,然后添加一组连续的整数,其大小为{ {1}}获取要更新的所有索引。因此,实现看起来像这样 -

regular_sequence

运行时测试 -

# Size of regular_sequence
N = regular_sequence.size

# Use cumsum to pre-compute start of every occurance of regular_sequence
offset_arr = np.cumsum(lag_seq)
idx = np.arange(offset_arr.size)*N + offset_arr

# Setup output array
out = np.zeros(idx.max() + N,dtype=regular_sequence.dtype)

# Broadcast the start indices to include entire length of regular_sequence
# to get all positions where regular_sequence elements are to be set
np.put(out,idx[:,None] + np.arange(N),regular_sequence)

答案 1 :(得分:2)

我认为最好的方法是使用卷积。您可以计算滞后长度,将其与序列的长度相结合,并使用它来计算每个常规序列的起始点。将这些起始点设置为零,然后与常规序列进行卷积以填充值。

import numpy as np

regular_sequence = np.array([1,2,3,4,5], dtype=np.int)
n_iter = 10000000
lag_mean = 10 # mean length of zeros sequence
lag_sd = 1 # standard deviation of zeros sequence length

# Sequence of lags lengths
lag_lens = np.round(np.random.normal(lag_mean, lag_sd, n_iter)).astype(np.int)
lag_lens[1:] += len(regular_sequence)
starts_inds = lag_lens.cumsum()-1

# Generate list of convolved ones and regular sequences
seq = np.zeros(lag_lens.sum(), dtype=np.int)
seq[starts_inds] = 1
seq = np.convolve(seq, regular_sequence)

即使在更改版本以使用numpy随机数生成器之后,这种方法在大型序列上的时间也会达到1/20左右。

答案 2 :(得分:1)

由于数据未对齐,因此不是一个小问题。性能取决于什么是 long 序列。以 square 问题为例:很多 long ,常规和零序列(n_iter==n_reg==lag_mean):

import numpy as np
n_iter = 1000
n_reg = 1000
regular_sequence = np.arange(n_reg, dtype=np.int)
lag_mean = n_reg # mean length of zeros sequence
lag_sd = lag_mean/10 # standard deviation of zeros sequence length
lag_seq=np.int64(np.random.normal(lag_mean,lag_sd,n_iter)) # Sequence of lags lengths

首先是你的解决方案:

def seq_hybrid():
    seqs = [np.concatenate((np.zeros(x, dtype=np.int), regular_sequence)) for x in lag_seq]
    seq = np.concatenate(seqs)
    return seq   

然后是一个纯粹的numpy:

def seq_numpy():
    seq=np.zeros(lag_seq.sum()+n_iter*n_reg,dtype=int)
    cs=np.cumsum(lag_seq+n_reg)-n_reg
    indexes=np.add.outer(cs,np.arange(n_reg))
    seq[indexes]=regular_sequence
    return seq

for循环解决方案:

def seq_python():
    seq=np.empty(lag_seq.sum()+n_iter*n_reg,dtype=int)
    i=0
    for lag in lag_seq:
        for k in range(lag):
            seq[i]=0
            i+=1
        for k in range(n_reg):
            seq[i]=regular_sequence[k]
            i+=1    
    return seq

与numba及时汇编:

from numba import jit
seq_numba=jit(seq_python)

现在测试:

In [96]: %timeit seq_hybrid()
10 loops, best of 3: 38.5 ms per loop

In [97]: %timeit seq_numpy()
10 loops, best of 3: 34.4 ms per loop

In [98]: %timeit seq_python()
1 loops, best of 3: 1.56 s per loop

In [99]: %timeit seq_numba()
100 loops, best of 3: 12.9 ms per loop

在这种情况下,你的混合解决方案与纯粹的numpy解决方案一样快,因为 性能主要取决于内循环。而你的(零和连接)是一个numpy。可以预见,python解决方案较慢,传统的 40x 因素。但是numpy在这里并不是最优的,因为它使用了花哨的索引,这对于未对齐的数据是必要的。在这种情况下,numba可以帮助您:在C级别完成最少的操作,与python解决方案相比,此次获得 120x 因子。

对于n_iter,n_reg的其他值,与python解决方案相比,因子增益为:

n_iter= 1000, n_reg= 1000 : seq_numba 124, seq_hybrid 49, seq_numpy 44. 
n_iter= 10, n_reg= 100000 : seq_numba 123, seq_hybrid 104, seq_numpy 49. 
n_iter= 100000, n_reg= 10 : seq_numba 127, seq_hybrid 1, seq_numpy 42. 

答案 3 :(得分:0)

我认为在这个问题上发布的答案有一个很好的方法,使用二进制掩码和np.convolve,但答案被删除了,我不知道为什么。这里有2个问题需要解决。

def insert_sequence(lag_seq, regular_sequence):
    offsets = np.cumsum(lag_seq)
    start_locs = np.zeros(offsets[-1] + 1, dtype=regular_sequence.dtype)
    start_locs[offsets] = 1
    return np.convolve(start_locs, regular_sequence)

lag_seq = np.random.normal(15,1,10)
lag_seq = lag_seq.astype(np.uint8)
regular_sequence = np.arange(1, 6)
seq = insert_sequence(lag_seq, regular_sequence)

print(repr(seq))