我必须找到我从hdf5文件中读取的NumPy数组的模式。 NumPy数组是1d并包含浮点值。
my_array=f1[ds_name].value
mod_value=scipy.stats.mode(my_array)
我的数组是1d,包含大约1M的值。我的脚本返回模式值大约需要15分钟。有没有办法让这更快?
另一个问题是为什么scipy.stats.median(my_array)
在模式工作时不起作用?
AttributeError:module' scipy.stats'没有属性'中位数'
答案 0 :(得分:2)
addMyType
的实现有一个Python循环,用于处理多维数组的scipy.stats.mode
参数。对于仅一维数组,以下简单实现更快:
axis
这是一个例子。首先,创建一个长度为1000000的整数数组。
def mode1(x):
values, counts = np.unique(x, return_counts=True)
m = counts.argmax()
return values[m], counts[m]
检查In [40]: x = np.random.randint(0, 1000, size=(2, 1000000)).sum(axis=0)
In [41]: x.shape
Out[41]: (1000000,)
和scipy.stats.mode
是否给出相同的结果。
mode1
现在检查一下表现。
In [42]: from scipy.stats import mode
In [43]: mode(x)
Out[43]: ModeResult(mode=array([1009]), count=array([1066]))
In [44]: mode1(x)
Out[44]: (1009, 1066)
In [45]: %timeit mode(x)
2.91 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [46]: %timeit mode1(x)
39.6 ms ± 83.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
为2.91秒,mode(x)
仅为39.6毫秒。
答案 1 :(得分:1)
这是一种基于排序的方法 -
def mode1d(ar_sorted):
ar_sorted.sort()
idx = np.flatnonzero(ar_sorted[1:] != ar_sorted[:-1])
count = np.empty(idx.size+1,dtype=int)
count[1:-1] = idx[1:] - idx[:-1]
count[0] = idx[0] + 1
count[-1] = ar_sorted.size - idx[-1] - 1
argmax_idx = count.argmax()
if argmax_idx==len(idx):
modeval = ar_sorted[-1]
else:
modeval = ar_sorted[idx[argmax_idx]]
modecount = count[argmax_idx]
return modeval, modecount
请注意,这会对输入数组进行排序/更改。因此,如果您想保持输入数组 un-mutated 或者记住要排序的输入数组,请传递副本。
示例在1M元素上运行 -
In [65]: x = np.random.randint(0, 1000, size=(1000000)).astype(float)
In [66]: from scipy.stats import mode
In [67]: mode(x)
Out[67]: ModeResult(mode=array([ 295.]), count=array([1098]))
In [68]: mode1d(x)
Out[68]: (295.0, 1098)
运行时测试
In [75]: x = np.random.randint(0, 1000, size=(1000000)).astype(float)
# Scipy's mode
In [76]: %timeit mode(x)
1 loop, best of 3: 1.64 s per loop
# @Warren Weckesser's soln
In [77]: %timeit mode1(x)
10 loops, best of 3: 52.7 ms per loop
# Proposed in this post
In [78]: %timeit mode1d(x)
100 loops, best of 3: 12.8 ms per loop
通过副本,mode1d
的时间与mode1
相当。
答案 2 :(得分:0)
我将上面回复中的两个函数mode1和mode1d添加到我的脚本中,并尝试与scipy.stats.mode进行比较。
dir_name="C:/Users/test_mode"
file_name="myfile2.h5"
ds_name="myds"
f_in=os.path.join(dir_name,file_name)
def mode1(x):
values, counts = np.unique(x, return_counts=True)
m = counts.argmax()
return values[m], counts[m]
def mode1d(ar_sorted):
ar_sorted.sort()
idx = np.flatnonzero(ar_sorted[1:] != ar_sorted[:-1])
count = np.empty(idx.size+1,dtype=int)
count[1:-1] = idx[1:] - idx[:-1]
count[0] = idx[0] + 1
count[-1] = ar_sorted.size - idx[-1] - 1
argmax_idx = count.argmax()
if argmax_idx==len(idx):
modeval = ar_sorted[-1]
else:
modeval = ar_sorted[idx[argmax_idx]]
modecount = count[argmax_idx]
return modeval, modecount
startTime=time.time()
with h5py.File(f_in, "a") as f1:
myds=f1[ds_name].value
time1=time.time()
file_read_time=time1-startTime
print(str(file_read_time)+"\t"+"s"+"\t"+str((file_read_time)/60)+"\t"+"min")
print("mode_scipy=")
mode_scipy=scipy.stats.mode(myds)
print(mode_scipy)
time2=time.time()
mode_scipy_time=time2-time1
print(str(mode_scipy_time)+"\t"+"s"+"\t"+str((mode_scipy_time)/60)+"\t"+"min")
print("mode1=")
mode1=mode1(myds)
print(mode1)
time3=time.time()
mode1_time=time3-time2
print(str(mode1_time)+"\t"+"s"+"\t"+str((mode1_time)/60)+"\t"+"min")
print("mode1d=")
mode1d=mode1d(myds)
print(mode1d)
time4=time.time()
mode1d_time=time4-time3
print(str(mode1d_time)+"\t"+"s"+"\t"+str((mode1d_time)/60)+"\t"+"min")
为大约1M的numpy数组运行脚本的结果是:
mode_scipy = ModeResult(mode = array([1.11903353e-06],dtype = float32),count = array([304909])) 938.8368742465973 s 15.647281237443288分钟
mode1 =(1.1190335e-06,304909)
0.06500649452209473 s
0.0010834415753682455分钟
mode1d =(1.1190335e-06,304909)
0.06200599670410156 s
0.0010334332784016928分钟
答案 3 :(得分:0)
这将返回具有相同计数的多个可能模式的最小值
class Averages:
def __init__(self, inputArray):
self.inputArray = inputArray
def mode(self):
"""
if multiple modes, returns min of multiple mode values
"""
rDic = {}
res = set(self.inputArray)
currentMax = 0
result = []
for char in res:
value = self.inputArray.count(char)
rDic[char] = value
if currentMax < value:
currentMax = value
for key, value in rDic.items():
if value == currentMax:
result.append(key)
result.sort()
return result[0]
def mean(self):
meanValue = sum(self.inputArray)/len(self.inputArray)
return meanValue
def median(self):
lenArray = len(self.inputArray)
self.inputArray.sort()
if lenArray % 2 == 0:
median = self.inputArray[(lenArray // 2)]
else:
median = (self.inputArray[(n // 2) - 1] + self.inputArray[(lenArray // 2)]) / 2
return median