Question

我正在尝试使用以下代码计算两组集群之间的ARI：

#computes ARI for this type of clustering
def ARI(table,n):

index = 0
sum_a = 0
sum_b = 0
for i in range(len(table)-1):
    for j in range(len(table)-1):
        sum_a += choose(table[i][len(table)-1],2)
        sum_b += choose(table[len(table)-1][j],2)
        index += choose(table[i][j],2)


expected_index = (sum_a*sum_b)
expected_index = expected_index/choose(n,2)
max_index = (sum_a+sum_b)
max_index = max_index/2

return (index - expected_index)/(max_index-expected_index)


#choose to compute rand
def choose(n,r):

f = math.factorial
if (n-r)>=0:
    return f(n) // f(r) // f(n-r)
else:
    return 0

假设我已正确创建列联表，我仍然得到超出（-1,1）范围的值。

例如：

列联表：

[1, 0, 0, 0, 0, 0, 0, 1]
[1, 0, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 1, 0, 0, 0, 1]
[0, 1, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 0, 0, 1, 1, 2]
[1, 0, 1, 0, 1, 0, 0, 3]
[0, 0, 0, 0, 0, 0, 1, 1]
[3, 1, 1, 1, 1, 1, 2, 0]

运行代码时，

产生-1.6470588235294115的ARI。这段代码中有错误吗？

这也是我计算应变矩阵的方法：

table = [[0 for _ in range(len(subjects)+1)]for _ in range(len(subjects)+1)]
#comparing all clusters
for i in range(len(clusters)):
    index_count = 0
    for subject, orgininsts in orig_clusters.items():
        madeinsts = clusters[i].instances
        intersect_count = 0
        #comparing all instances between the 2 clusters
        for orginst in orgininsts:
            for madeinst in makeinsts:
                if orginst == madeinst:
                    intersect_count += 1

        table[index_count][i] = intersect_count
        index_count += 1


for i in range(len(table)-1):
    a = 0
    b = 0
    for j in range(len(table)-1):
        a += table[i][j]
        b += table[j][i]

    table[i][len(table)-1] = a
    table[len(table)-1][i] = b

clusters是具有属性instances的群集对象的列表，该属性是该群集中包含的实例列表。 orig_clusters是一个dictonary，其键表示集群标签，值是该集群中包含的实例列表。这段代码中有错误吗？

Answer 1

你在代码中计算ARI会犯一些错误 - 你经常计算a和b，因为你循环遍历你的桌子而不是一次。

此外，您将n作为参数传递，但显然它设置为10（这就是我获得结果的方式）。通过表然后从那里计算n会更容易。我修改了你的代码：

def ARI(table):
    index = 0
    sum_a = 0
    sum_b = 0
    n = sum([sum(subrow) for subrow in table]) #all items summed

    for i in range(len(table)):
        b_row = 0#this is to hold the col sums
        for j in range(len(table)):
            index += choose(table[i][j], 2)
            b_row += table[j][i]
        #outside of j-loop b.c. we want to use a=rowsums, b=colsums
        sum_a += choose(sum(table[i]), 2)
        sum_b += choose(b_row, 2)

    expected_index = (sum_a*sum_b)
    expected_index = expected_index/choose(n,2)
    max_index = (sum_a+sum_b)
    max_index = max_index/2

    return (index - expected_index)/(max_index-expected_index)

或者如果你传递行和列的总和：

def ARI(table):

    index = 0
    sum_a = 0
    sum_b = 0
    n = sum(table[len(table)-1]) + sum([table[i][len(table)-1] for i in range(len(table)-1)])
    for i in range(len(table)-1):
        sum_a += choose(table[i][len(table)-1],2)
        sum_b += choose(table[len(table)-1][i],2)
        for j in range(len(table)-1):
            index += choose(table[i][j],2)

    expected_index = (sum_a*sum_b)
    expected_index = expected_index/choose(n,2)
    max_index = (sum_a+sum_b)
    max_index = max_index/2

    return (index - expected_index)/(max_index-expected_index)

然后

def choose(n,r):
    f = math.factorial
    if (n-r)>=0:
        return f(n) // f(r) // f(n-r)
    else:
        return 0

table = [[1, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1, 1, 2],
[1, 0, 1, 0, 1, 0, 0, 3],
[0, 0, 0, 0, 0, 0, 1, 1],
[3, 1, 1, 1, 1, 1, 2, 0]]

ARI(table)

ARI(table)
Out[56]: -0.0604008667388949

正确的结果！

计算调整兰德指数

1 个答案: