在2D数据集中检测曲线的最聪明方法是什么?必须通过定义到邻居的最大距离来对数据点进行聚类。我的目标是在每条曲线上应用polyfit函数,并将此模板用于相同的数据集。
数据示例:
array([[0.,0.,0.,...,2020.,2020.,2020.], [51.,76.,194.,...,1862.,1915.,2021。]]]
弄清楚这可以通过聚集聚类来完成,这是代码和结果:
from sklearn.cluster import AgglomerativeClustering
#Reshape data
a = array[:, 0].flatten()
b = array[:, 1].flatten()
array_new = np.matrix([a,b])
array_new = np.squeeze(np.asarray(array_new))
array_new1 = array_new.T
#Clustering algorithm
n_clusters = None
model = AgglomerativeClustering(n_clusters=n_clusters,
affinity='euclidean',
linkage='single',
compute_full_tree=True,
distance_threshold=15)
model.fit(array_new1)
labels = model.labels_
n_clusters = len(list(set(labels)))
print(n_clusters)
cmap = plt.get_cmap('rainbow')
colors = [cmap(i) for i in np.linspace(0, 1, n_clusters)]
plt.figure(figsize=(15,15))
for i, color in enumerate(colors, start=1):
plt.scatter(array_new1[labels==i,0], array_new1[labels==i,1], color=color)
plt.gca().invert_yaxis()
plt.show()
![](https://i.stack.imgur.com/utwqP.png)
#plotting result
data = pd.DataFrame({'x' : array_new1[:,0],
'y' : array_new1[:,1],
'label' : labels})
data.sort_values(by='label')
counter = 0
plt.figure(figsize=(15,15))
plt.scatter(5*array[:, 0], array[:, 1])
for i in range(n_clusters):
if len(data.loc[data['label'] == i].iloc[:,0]) > 50 \
and len(data.loc[data['label'] == i].iloc[:,0]) < 1000:
counter += 1
z = np.polyfit(data.loc[data['label'] == i].iloc[:,0],
data.loc[data['label'] == i].iloc[:,1],
2)
p = np.poly1d(z)
xp = np.linspace(0, tasku_sk, 50)
#plt.scatter(data.loc[data['label'] == i].iloc[:,0],
# data.loc[data['label'] == i].iloc[:,1])
plt.plot(5*xp, p(xp), c='r', lw=4)
plt.gca().invert_yaxis()
plt.show()
print(counter)
![](https://i.stack.imgur.com/AQHOf.png)
22
答案 0 :(得分:0)
是的
所有聚类算法中据认为最古老的算法:单链接。