我正在尝试使用以下代码计算两组集群之间的ARI:
#computes ARI for this type of clustering
def ARI(table,n):
index = 0
sum_a = 0
sum_b = 0
for i in range(len(table)-1):
for j in range(len(table)-1):
sum_a += choose(table[i][len(table)-1],2)
sum_b += choose(table[len(table)-1][j],2)
index += choose(table[i][j],2)
expected_index = (sum_a*sum_b)
expected_index = expected_index/choose(n,2)
max_index = (sum_a+sum_b)
max_index = max_index/2
return (index - expected_index)/(max_index-expected_index)
#choose to compute rand
def choose(n,r):
f = math.factorial
if (n-r)>=0:
return f(n) // f(r) // f(n-r)
else:
return 0
假设我已正确创建列联表,我仍然得到超出(-1,1)范围的值。
例如:
列联表:
[1, 0, 0, 0, 0, 0, 0, 1]
[1, 0, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 1, 0, 0, 0, 1]
[0, 1, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 0, 0, 1, 1, 2]
[1, 0, 1, 0, 1, 0, 0, 3]
[0, 0, 0, 0, 0, 0, 1, 1]
[3, 1, 1, 1, 1, 1, 2, 0]
运行代码时,产生-1.6470588235294115
的ARI。
这段代码中有错误吗?
这也是我计算应变矩阵的方法:
table = [[0 for _ in range(len(subjects)+1)]for _ in range(len(subjects)+1)]
#comparing all clusters
for i in range(len(clusters)):
index_count = 0
for subject, orgininsts in orig_clusters.items():
madeinsts = clusters[i].instances
intersect_count = 0
#comparing all instances between the 2 clusters
for orginst in orgininsts:
for madeinst in makeinsts:
if orginst == madeinst:
intersect_count += 1
table[index_count][i] = intersect_count
index_count += 1
for i in range(len(table)-1):
a = 0
b = 0
for j in range(len(table)-1):
a += table[i][j]
b += table[j][i]
table[i][len(table)-1] = a
table[len(table)-1][i] = b
clusters
是具有属性instances
的群集对象的列表,该属性是该群集中包含的实例列表。 orig_clusters
是一个dictonary,其键表示集群标签,值是该集群中包含的实例列表。这段代码中有错误吗?
答案 0 :(得分:0)
你在代码中计算ARI会犯一些错误 - 你经常计算a和b,因为你循环遍历你的桌子而不是一次。
此外,您将n作为参数传递,但显然它设置为10(这就是我获得结果的方式)。通过表然后从那里计算n会更容易。我修改了你的代码:
def ARI(table):
index = 0
sum_a = 0
sum_b = 0
n = sum([sum(subrow) for subrow in table]) #all items summed
for i in range(len(table)):
b_row = 0#this is to hold the col sums
for j in range(len(table)):
index += choose(table[i][j], 2)
b_row += table[j][i]
#outside of j-loop b.c. we want to use a=rowsums, b=colsums
sum_a += choose(sum(table[i]), 2)
sum_b += choose(b_row, 2)
expected_index = (sum_a*sum_b)
expected_index = expected_index/choose(n,2)
max_index = (sum_a+sum_b)
max_index = max_index/2
return (index - expected_index)/(max_index-expected_index)
或者如果你传递行和列的总和:
def ARI(table):
index = 0
sum_a = 0
sum_b = 0
n = sum(table[len(table)-1]) + sum([table[i][len(table)-1] for i in range(len(table)-1)])
for i in range(len(table)-1):
sum_a += choose(table[i][len(table)-1],2)
sum_b += choose(table[len(table)-1][i],2)
for j in range(len(table)-1):
index += choose(table[i][j],2)
expected_index = (sum_a*sum_b)
expected_index = expected_index/choose(n,2)
max_index = (sum_a+sum_b)
max_index = max_index/2
return (index - expected_index)/(max_index-expected_index)
然后
def choose(n,r):
f = math.factorial
if (n-r)>=0:
return f(n) // f(r) // f(n-r)
else:
return 0
table = [[1, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1, 1, 2],
[1, 0, 1, 0, 1, 0, 0, 3],
[0, 0, 0, 0, 0, 0, 1, 1],
[3, 1, 1, 1, 1, 1, 2, 0]]
ARI(table)
ARI(table)
Out[56]: -0.0604008667388949
正确的结果!