大家好我有以下代码:
from math import sqrt
array = [(1,'a',10), (2,'a',11), (3,'c',200), (60,'a',12), (70,'t',13), (80,'g',300), (100,'a',305), (220,'c',307), (230,'t',306), (250,'g',302)]
def stat(lst):
"""Calculate mean and std deviation from the input list."""
n = float(len(lst))
mean = sum([pair[0] for pair in lst])/n
## mean2 = sum([pair[2] for pair in lst])/n
stdev = sqrt((sum(x[0]*x[0] for x in lst) / n) - (mean * mean))
## stdev2 = sqrt((sum(x[2]*x[2] for x in lst) / n) - (mean2 * mean2))
return mean, stdev
def parse(lst, n):
cluster = []
for i in lst:
if len(cluster) <= 1: # the first two values are going directly in
cluster.append(i)
continue
###### add also the distance between lengths
mean,stdev = stat(cluster)
if (abs(mean - i[0]) > n * stdev): # check the "distance"
yield cluster
cluster[:] = [] # reset cluster to the empty list
cluster.append(i)
yield cluster # yield the last cluster
for cluster in parse(array, 7):
print(cluster)
它的作用是通过查看变量i [0]来聚类我的元组列表(数组)。 我想要实现的是在我的每个元组中通过变量i [2]进一步聚类。
当前输出为:
[(1, 'a', 10), (2, 'a', 11), (3, 'c', 200)]
[(60, 'a', 12), (70, 't', 13), (80, 'g', 300), (100, 'a', 305)]
[(220, 'c', 307), (230, 't', 306), (250, 'g', 302)]
我想要像:
[(1, 'a', 10), (2, 'a', 11)]
[(3, 'c', 200)]
[(60, 'a', 12), (70, 't', 13)]
[(80, 'g', 300), (100, 'a', 305)]
[(220, 'c', 307), (230, 't', 306), (250, 'g', 302)]
因此i [0]的值接近,i [2]也接近。任何想法如何破解它?
答案 0 :(得分:0)
您可以第二次使用parse
方法获取首次运行的结果。在这种情况下,您将收到的不完全相同,但非常相似:
def stat(lst, index):
"""Calculate mean and std deviation from the input list."""
n = float(len(lst))
mean = sum([pair[index] for pair in lst])/n
stdev = sqrt((sum(x[index]*x[index] for x in lst) / n) - (mean * mean))
return mean, stdev
def parse(lst, n, index):
cluster = []
for i in lst:
if len(cluster) <= 1: # the first two values are going directly in
cluster.append(i)
continue
mean, stdev = stat(cluster, index)
if (abs(mean - i[index]) > n * stdev): # check the "distance"
yield cluster
cluster[:] = [] # reset cluster to the empty list
cluster.append(i)
yield cluster # yield the last cluster
for cluster in parse(array, 7, 0):
for nc in parse(cluster, 3, 2):
print nc
[(1, 'a', 10), (2, 'a', 11)]
[(3, 'c', 200)]
[(60, 'a', 12), (70, 't', 13)]
[(80, 'g', 300), (100, 'a', 305)]
[(220, 'c', 307), (230, 't', 306)]
[(250, 'g', 302)]
答案 1 :(得分:0)
首先,您计算方差的方式是数值不稳定。 E(X^2)-E(X)^2
以数学方式保持,但会杀死数值精度。最坏的情况是你得到负值,然后sqrt
失败。
你真的应该调查numpy
,它可以为你正确计算。
从概念上讲,您是否考虑将数据视为二维数据空间?然后,您可以对其进行白化,然后运行,例如k-means或任何其他基于矢量的聚类算法。
标准偏差和平均值对于抽象为多个属性是微不足道的(查找“Mahalanobis distance”)。