聚类由汉明距离组成的图(斯坦福算法 - 2)

时间:2016-04-16 21:18:03

标签: algorithm graph

问题陈述是:

In this question your task is again to run the clustering algorithm from lecture, 
but on a MUCH bigger graph. 
So big, in fact, that the distances (i.e., edge costs) are only defined implicitly,
rather than being provided as an explicit list.
The data set is here. The format is:
[# of nodes] [# of bits for each node's label]
[first bit of node 1] ... [last bit of node 1]
[first bit of node 2] ... [last bit of node 2]

For example, the third line of the file 
"0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1" 
denotes the 24 bits associated with node #2.

The distance between two nodes u and v in this problem is defined as the Hamming 
distance--- the number of differing bits --- between the two nodes' labels. For 
example, the Hamming distance between the 24-bit label of node #2 above and the 
label "0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1" is 3 (since they 
differ in the 3rd, 7th, and 21st bits).

The question is: what is the largest value of k such that there is a k-clustering 
with spacing at least 3? That is, how many clusters are needed to ensure that no 
pair of nodes with all but 2 bits in common get split into different clusters?

NOTE: The graph implicitly defined by the data file is so big that you probably 
can't write it out explicitly, let alone sort the edges by cost. So you will have 
to be a little creative to complete this part of the question. 
For example, is there some way you can identify the smallest distances without 
explicitly looking at every pair of nodes?

可以下载数据集here

这里的挑战是比O(n ^ 2)更快地创建图形。图表有 200,000个节点因此我无法继续计算每个边缘的汉明距离,因为24位用于表示标签,这将为我的图形添加2 ^ 24 = 16mil边缘,这是不可行的。

我的看法是,将二进制数据转换为整数并对它们进行排序(O(nlgn)时间)然后,对于int数字表示的每个顶点,在当前数字和下一个数字之间创建一条边,因为数字越远,越多汉明距离将是。

简化例如:

000 Let this be node A
001 
010
011  
100 Node B
101
110
111 Node C

现在,A和B中的汉明距离= 1,B和C = 2以及A和C = 3.我知道这里有更多细微之处但是汉明距离(A,C)> =汉明距离(A,B) )或汉明距离(B,C)将始终保持。

通过这种方式,我可以将图形设置为线性时间,将其想象为直线和节点上显示的节点。稍后,我可以使用不相交的树/联合查找对它们进行聚类,并找到问题中询问的最小聚类数。

论坛中的测试用例说,对于this file中的前1000个节点,群集的数量是989,但我的程序告诉我它的999。 另外,graphInfo()告诉我有0个相同的边,1个边有权重1,0个边有权重2.而实际结果是

Edges with cost zero: 0
Edges with cost one: 2 
Edges with cost two: 9

代码非常复杂,因此请使用this链接检查代码。我无法弄清楚我的代码或算法是否错误。

1 个答案:

答案 0 :(得分:0)

我没有看过你的算法,但我注意到在O(n ^ 2)时间内运行https://en.wikipedia.org/wiki/Prim%27s_algorithm相对容易,但只有O(n)空间。如果您查看伪代码的更详细版本,您可以看到,您只需要为每个节点保留将当前正在生成的小树链接到该节点的任何边缘的最便宜成本,以及边缘的标识对于那个(或者,等效地,当前正在生长的树中的节点,你将连接到它)

在每个阶段,您都会找到从正在生长的树到其中不存在的节点的最便宜的链接,然后检查来自该新节点的链接,看看它们是否提供了从正在生长的树到尚未进入的节点的更便宜的方式它

你有可能提供O(n ^ 2)时间而不是O(n ^ 2)空间吗?如果有领带,你从Prim获得的树可能与从Kruskal获得的树不同,但它将是最小的跨越。