我有一个工作的去聚类算法,我想加速使用numpy。给定数组a
,获得连续差异diffa
。然后检查这些连续差异中的每一个,以查看每个差异是大于还是小于某个阈值t_c
,这会产生0
和1
的{{1}}数组。 }和False
。考虑到True
是一个小于diffa
的索引,计数模式会略有修改。首先,a
和0
的每个群集的大小计算为数组1
。如果数组包含cl_size
,则群集的大小为原始大小加1;如果数组包含0
,则群集的大小为其原始大小减1。下面是一个我想要适应更大数据集的例子。
1
在此之后,我通过可导入模块import numpy as np
thresh = 21
a = np.array([1, 2, 5, 10, 20, 40, 70, 71, 72, 74, 100, 130, 160, 171, 200, 201])
diffa = np.diff(a)
print(diffa)
>> [ 1 3 5 10 20 30 1 1 2 26 30 30 11 29 1]
def get_cluster_statistics(array, t_c, func_kw='all'):
""" This function separates clusters of datapoints such that the number
of clusters and the number of events in each cluster can be known. """
# GET CONSECUTIVE DIFFERENCES
ts_dif = np.diff(array)
# GET BOOLEAN ARRAY MASK OF 0's AND 1's FOR TIMES ABOVE THRESHOLD T_C
bool_mask = np.array(ts_dif > t_c) * 1
# COPY BOOLEAN ARRAY MASK (DO NOT MODIFY ORIGINAL ARRAY)
bm_arr = bool_mask[:]
# SPLIT CLUSTERS INTO SUB-ARRAYS
res = np.split(bm_arr, np.where(abs(np.diff(bm_arr)) != 0)[0] + 1)
print(res)
>>[array([0, 0, 0, 0, 0]), array([1]), array([0, 0, 0]), array([1, 1, 1]), array([0]), array([1]), array([0])]
# GET SIZE OF EACH SUB-ARRAY CLUSTER
cl_size = np.array([res[idx].size for idx in range(len(res))])
print(cl_size)
>>[5 1 3 3 1 1 1]
# CHOOSE BETWEEN CHECKING ANY OR ALL VALUES OF SUB-ARRAYS (check timeit)
func = dict(zip(['all', 'any'], [np.all, np.any]))[func_kw]
# INITIALIZE EMPTY OUTPUT LIST
ans = []
# CHECK EACH SPLIT SUB-ARRAY IN RES
for idx in range(len(res)):
# print("res[%d] = %s" %(idx, res[idx]))
if func(res[idx] == 1):
term = [1 for jdx in range(cl_size[idx]-1)]
# cl_size[idx] = cl_size[idx]-1
ans.append(term)
elif func(res[idx] == 0):
# cl_size[idx] = cl_size[idx]+1
term = [cl_size[idx]+1]
ans.append(term)
print(ans)
>> [[6], [], [4], [1, 1], [2], [], [2]]
out = np.sum(ans)
print(out)
>> [6, 4, 1, 1, 2, 2]
get_cluster_statistics(a, thresh, 'any')
应用Counter
来计算各种大小的群集的频率。
我不确定如何,但我认为有一个更有效的numpy解决方案,特别是在collections
下的代码部分。任何帮助将不胜感激。