我正在修读Coursera课程,这是我的任务:
In this question your task is again to run the clustering algorithm from
lecture, but on a MUCH bigger graph. So big, in fact, that the distances
(i.e., edge costs) are only defined implicitly, rather than being provided
as an explicit list.
The data set is below.
clustering_big.txt
The format is:
[# of nodes] [# of bits for each node's label]
[first bit of node 1] ... [last bit of node 1]
[first bit of node 2] ... [last bit of node 2]
...
For example, the third line of the file "0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0
1 0 1 1 0 1" denotes the 24 bits associated with node #2.
The distance between two nodes u and v in this problem is defined as the
Hamming distance--- the number of differing bits --- between the two nodes'
labels. For example, the Hamming distance between the 24-bit label of node
#2 above and the label "0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1" is
3 (since they differ in the 3rd, 7th, and 21st bits).
The question is: what is the largest value of k such that there is a k-
clustering with spacing at least 3? That is, how many clusters are needed to
ensure that no pair of nodes with all but 2 bits in common get split into
different clusters?
NOTE: The graph implicitly defined by the data file is so big that you
probably can't write it out explicitly, let alone sort the edges by cost. So
you will have to be a little creative to complete this part of the question.
For example, is there some way you can identify the smallest distances
without explicitly looking at every pair of nodes?
数据集的链接是here。
我在这里恰当地描述了解决这个问题的方法:
For each vertex, generate and store all Hamming distances that are 0, 1 and
2 units apart. There is only 1 code point that is 0 units apart (which is
the same code as the vertex), 24C1 = 24 possible code points that are 1 unit
apart and there are 24C2 = 276 possible code points that are 2 units apart
for each vertex.
Now, put all vertexes along with their assigned code into a hash table. Use
the code as the hash table key, with the vertex number as the value - note
that some codes are not unique (i.e. more than one vertex can be associated
with the same code), so each key in the hash table will have to potentially
hold more than one vertex - we will use this hash table later to look up the
vertex number(s) given the corresponding Hamming code in O(1) time.
Then execute the following steps:
For each vertex (200K iterations):
For each code that is 0 units apart from
this vertex: (1 iteration - there is only one such code
which is the same code as that of the vertex itself)
- Use the code to index into the hash table and
get the corresponding vertexes if they exist.
- Add these 2 vertexes to a cluster.
For each vertex (200K iterations):
For each code that is 1 unit apart from
this vertex: (24 iterations)
- Use the code to index into the hash table and
get the corresponding vertexes if they exist.
- Add these 2 vertexes to a cluster.
For each vertex (200K iterations):
For each code that is 2 units apart from
this vertex: (276 iterations)
- Use the code to index into the hash table and
get the corresponding vertexes if they exist.
- Add these 2 vertexes to a cluster.
You are now left with clusters that are at least 3 units apart.
In the first loop, we are essentially clustering all vertexes that are a
distance of 0 units apart, in the second loop and third loop we are
clustering vertexes that are 1 unit apart, and 2 units apart respectively
(this is similar to sorting by edge weights and then combining the vertexes
into clusters). The above code can be made much more compact - I have split
up the three main loops for readability.
The time complexity of the above is 200k + (200k * 24) + (200k * 276) = 200k
* 301 = O(301n) iterations, plus for each iteration, we have to fix up the
leader pointers of the smaller cluster - which gives us a final complexity
of O(301nlog n). The space complexity is about O(301n).
以下是我对这种方法的实施:
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.util.*;
public class bits_k_clustering {
public static class Vertex{
private int leader;
public Vertex(int u_leader)
{
leader = u_leader;
}
public void UpdateLeader(int newLeader)
{
leader = newLeader;
}
}
public static class UnionFind{
private ArrayList<Vertex> vertices;
private int clustersCnt;
private ArrayList<Integer> clustersSize;
public UnionFind(ArrayList<Vertex> u_vertices)
{
vertices = u_vertices;
clustersCnt = vertices.size();
clustersSize = new ArrayList<Integer>();
for(int i =0;i<200000;i++)
{
clustersSize.add(1);
}
}
public void Union(Vertex x,Vertex y)
{
int leader1 = x.leader;
int leader2 = x.leader;
if(leader1!=leader2)
{
clustersCnt-=1;
if(clustersSize.get(leader1)>clustersSize.get(leader2))
{
clustersSize.set(leader1,clustersSize.get(leader1)+clustersSize.get(leader2));
for(int i = 0;i<200000;i++)
{
if(vertices.get(i).leader==leader2)
{
vertices.get(i).UpdateLeader(leader1);
}
}
clustersSize.set(leader2,0);
}
else
{
clustersSize.set(leader2, clustersSize.get(leader1)+clustersSize.get(leader2));
for(int i = 0;i<200000;i++)
{
if(vertices.get(i).leader==leader1)
{
vertices.get(i).UpdateLeader(leader2);
}
}
clustersSize.set(leader1,0);
}
}
}
}
public static void main(String[] args) {
// TODO Auto-generated method stub
String [] bits = new String[200000];
ArrayList<Vertex> vertices = new ArrayList<Vertex>();
try{
File file = new File("C:/Users/dadi/Desktop/K-Clustering.txt");
Scanner in = new Scanner(file);
in.nextLine();
for(int i = 0;i<200000;i++)
{
String[] strarray = in.nextLine().split(" ");
String mayhew = "";
for(int j =0;j<strarray.length;j++)
{
mayhew+=strarray[j];
}
bits[i] = mayhew;
vertices.add(new Vertex(i));
}
in.close();
System.out.println("Buildup done,dists zero complete!");
}
catch(IOException ex){
ex.printStackTrace();
}
String [][] dist1=new String[200000][24];
for(int i =0;i<200000;i++)
{
for(int j = 0;j<24;j++)
{
char[] x = bits[i].toCharArray();
if(x[j]=='0')
{
x[j]='1';
}
else
{
x[j]='0';
}
dist1[i][j]=String.valueOf(x);
}
}
System.out.println("Dists one complete!");
String [][] dist2 = new String[200000][276];
for(int i = 0;i<200000;i++)
{
int z = 0;
for(int j = 0;j<23;j++)
{
for(int k = j+1;k<24;k++)
{
char [] x = bits[i].toCharArray();
if(x[j]=='0' && x[k]=='0')
{
x[j]='1';
x[k]='1';
}
else if(x[j]=='0' && x[k]=='1')
{
x[j]='1';
x[k]='0';
}
else if(x[j]=='1' && x[k]=='0')
{
x[j]='0';
x[k]='1';
}
else if(x[j]=='1' && x[k]=='1')
{
x[j]='0';
x[k]='0';
}
dist2[i][z]=String.valueOf(x);
z++;
}
}
}
System.out.println("Dists two complete!");
Map<String,ArrayList<Integer> > nodes = new HashMap<String,ArrayList<Integer> >();
for(int i =0;i<200000;i++)
{
ArrayList<Integer> xef = new ArrayList<Integer>();
nodes.put(bits[i],xef);
}
for(int i = 0;i<200000;i++)
{
ArrayList<Integer> mef = nodes.get(bits[i]);
mef.add(i);
nodes.put(bits[i], mef);
}
System.out.println("Map complete!");
UnionFind uf = new UnionFind(vertices);
for(int i = 0;i<200000;i++)
{
ArrayList<Integer> def = nodes.get(bits[i]);
if(def.size()>1)
{
for(int j = 0;j<def.size();j++)
{
if(i==def.get(j))
{
continue;
}
uf.Union(vertices.get(i), vertices.get(def.get(j)));
}
}
}
System.out.println("Spacing zero clustered!");
for(int i = 0;i<200000;i++)
{
for(int j = 0;j<24;j++)
{
ArrayList<Integer> nef = nodes.get(dist1[i][j]);
if(nef!=null)
{
for(int k = 0;k<nef.size();k++)
{
uf.Union(vertices.get(i), vertices.get(nef.get(k)));
}
}
}
}
System.out.println("Spacing one clustered!");
for(int i = 0;i<200000;i++)
{
for(int j =0;j<276;j++)
{
ArrayList<Integer> oef = nodes.get(dist2[i][j]);
if(oef!=null)
{
for(int k =0;k<oef.size();k++)
{
uf.Union(vertices.get(i), vertices.get(oef.get(k)));
}
}
}
}
System.out.println("Spacing two clustered!");
System.out.println(uf.clustersCnt);
}
}
然而,当程序运行时,它可以顺利地进入“两个”程序中。阶段,此时它会减慢很多,直到它最终抛出这个错误:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.toCharArray(Unknown Source)
at bits_k_clustering.main(bits_k_clustering.java:137)
我已将Eclipse.ini设置更改为-Xmx1024m以获得最大内存堆大小,但它仍然抛出此错误。我不知道它为什么会抛出此错误(memheap应该有足够的内存用于默认情况下矩阵),为什么会出现此错误以及如何解决?
答案 0 :(得分:0)
toCharArray
每次 String
将char[]
复制<{1}},完全没必要。
请勿使用String
存储位或数字。如果您的位数少于32位,请使用int
。如果您的位数少于64位,请使用long
。如果更多,请使用long[]
。
尝试基于位操作的一些优化。例如,您可以使用简单的位计数和xor运算来计算汉明距离。您还可以根据设置位数得到一个便宜的下限 - 如果一个有6位,另外2个,至少4位必须不同。
避免使用ArrayList<Integer>
和ArrayList<Vertex>
。这些每个整数大约需要20个字节而不是4个。这是400%的开销。使用int[]
+ size
,double数组(如果已满)(ArrayList执行相同操作,但使用盒装整数)。
使用像visualvm这样的探查器来查看你的内存。
我的猜测是String [][] dist2 = new String[200000][276];
应该受到指责。 200000 * 276 * 50可能足以吃掉你所有的记忆。摆脱无用的字符串!