获取numpy数组模式的最快方法是什么

时间:2017-09-22 13:22:51

标签: numpy scipy

我必须找到我从hdf5文件中读取的NumPy数组的模式。 NumPy数组是1d并包含浮点值。

my_array=f1[ds_name].value    
mod_value=scipy.stats.mode(my_array)

我的数组是1d,包含大约1M的值。我的脚本返回模式值大约需要15分钟。有没有办法让这更快?

另一个问题是为什么scipy.stats.median(my_array)在模式工作时不起作用?

  

AttributeError:module' scipy.stats'没有属性'中位数'

4 个答案:

答案 0 :(得分:2)

addMyType的实现有一个Python循环,用于处理多维数组的scipy.stats.mode参数。对于仅一维数组,以下简单实现更快:

axis

这是一个例子。首先,创建一个长度为1000000的整数数组。

def mode1(x):
    values, counts = np.unique(x, return_counts=True)
    m = counts.argmax()
    return values[m], counts[m]

检查In [40]: x = np.random.randint(0, 1000, size=(2, 1000000)).sum(axis=0) In [41]: x.shape Out[41]: (1000000,) scipy.stats.mode是否给出相同的结果。

mode1

现在检查一下表现。

In [42]: from scipy.stats import mode

In [43]: mode(x)
Out[43]: ModeResult(mode=array([1009]), count=array([1066]))

In [44]: mode1(x)
Out[44]: (1009, 1066)

In [45]: %timeit mode(x) 2.91 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [46]: %timeit mode1(x) 39.6 ms ± 83.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 为2.91秒,mode(x)仅为39.6毫秒。

答案 1 :(得分:1)

这是一种基于排序的方法 -

def mode1d(ar_sorted):
    ar_sorted.sort()
    idx = np.flatnonzero(ar_sorted[1:] != ar_sorted[:-1])
    count = np.empty(idx.size+1,dtype=int)
    count[1:-1] = idx[1:] - idx[:-1]
    count[0] = idx[0] + 1
    count[-1] = ar_sorted.size - idx[-1] - 1
    argmax_idx = count.argmax()

    if argmax_idx==len(idx):
        modeval = ar_sorted[-1]
    else:
        modeval = ar_sorted[idx[argmax_idx]]
    modecount = count[argmax_idx]
    return modeval, modecount

请注意,这会对输入数组进行排序/更改。因此,如果您想保持输入数组 un-mutated 或者记住要排序的输入数组,请传递副本。

示例在1M元素上运行 -

In [65]: x = np.random.randint(0, 1000, size=(1000000)).astype(float)

In [66]: from scipy.stats import mode

In [67]: mode(x)
Out[67]: ModeResult(mode=array([ 295.]), count=array([1098]))

In [68]: mode1d(x)
Out[68]: (295.0, 1098)

运行时测试

In [75]: x = np.random.randint(0, 1000, size=(1000000)).astype(float)

# Scipy's mode
In [76]: %timeit mode(x)
1 loop, best of 3: 1.64 s per loop

# @Warren Weckesser's soln
In [77]: %timeit mode1(x)
10 loops, best of 3: 52.7 ms per loop

# Proposed in this post
In [78]: %timeit mode1d(x)
100 loops, best of 3: 12.8 ms per loop

通过副本,mode1d的时间与mode1相当。

答案 2 :(得分:0)

我将上面回复中的两个函数mode1和mode1d添加到我的脚本中,并尝试与scipy.stats.mode进行比较。

dir_name="C:/Users/test_mode"
file_name="myfile2.h5"
ds_name="myds"
f_in=os.path.join(dir_name,file_name)

def mode1(x):
    values, counts = np.unique(x, return_counts=True)
    m = counts.argmax()
    return values[m], counts[m]

def mode1d(ar_sorted):
    ar_sorted.sort()
    idx = np.flatnonzero(ar_sorted[1:] != ar_sorted[:-1])
    count = np.empty(idx.size+1,dtype=int)
    count[1:-1] = idx[1:] - idx[:-1]
    count[0] = idx[0] + 1
    count[-1] = ar_sorted.size - idx[-1] - 1
    argmax_idx = count.argmax()

    if argmax_idx==len(idx):
        modeval = ar_sorted[-1]
    else:
        modeval = ar_sorted[idx[argmax_idx]]
    modecount = count[argmax_idx]
    return modeval, modecount


startTime=time.time()
with h5py.File(f_in, "a") as f1:

        myds=f1[ds_name].value
        time1=time.time()
        file_read_time=time1-startTime
        print(str(file_read_time)+"\t"+"s"+"\t"+str((file_read_time)/60)+"\t"+"min")

        print("mode_scipy=")
        mode_scipy=scipy.stats.mode(myds)
        print(mode_scipy)
        time2=time.time()
        mode_scipy_time=time2-time1
        print(str(mode_scipy_time)+"\t"+"s"+"\t"+str((mode_scipy_time)/60)+"\t"+"min")

        print("mode1=")
        mode1=mode1(myds)
        print(mode1)
        time3=time.time()
        mode1_time=time3-time2
        print(str(mode1_time)+"\t"+"s"+"\t"+str((mode1_time)/60)+"\t"+"min")

        print("mode1d=")
        mode1d=mode1d(myds)
        print(mode1d)
        time4=time.time()
        mode1d_time=time4-time3
        print(str(mode1d_time)+"\t"+"s"+"\t"+str((mode1d_time)/60)+"\t"+"min")

为大约1M的numpy数组运行脚本的结果是:

mode_scipy = ModeResult(mode = array([1.11903353e-06],dtype = float32),count = array([304909])) 938.8368742465973 s 15.647281237443288分钟

mode1 =(1.1190335e-06,304909)

0.06500649452209473 s
0.0010834415753682455分钟

mode1d =(1.1190335e-06,304909)

0.06200599670410156 s
0.0010334332784016928分钟

答案 3 :(得分:0)

这将返回具有相同计数的多个可能模式的最小值

class Averages:

    def __init__(self, inputArray):
        self.inputArray = inputArray

    def mode(self):
        """
        if multiple modes, returns min of multiple mode values
        """
        rDic = {}
        res = set(self.inputArray)
        currentMax = 0
        result = []
        for char in res:
            value = self.inputArray.count(char)
            rDic[char] = value
            if currentMax < value:
                currentMax = value
        for key, value in rDic.items():
            if value == currentMax:
                result.append(key)
        result.sort()
        return result[0]

    def mean(self):
        meanValue = sum(self.inputArray)/len(self.inputArray)
        return meanValue

    def median(self):
        lenArray = len(self.inputArray)
        self.inputArray.sort()
        if lenArray % 2 == 0:
            median = self.inputArray[(lenArray // 2)]

        else:
            median = (self.inputArray[(n // 2) - 1] + self.inputArray[(lenArray // 2)]) / 2
        return median