我有一个名为FindSimilar
的类,它使用minHash来查找2组之间的相似性(对于这个目标,它的效果很好)。我的问题是我需要比较2套以上,更具体地说,我需要将给定的set1
与未知数量的其他集进行比较。这是班级:
import java.util.HashSet;
import java.util.Map;
import java.util.Random;
import java.util.Set;
public class FindSimilar<T>
{
private int hash[];
private int numHash;
public FindSimilar(int numHash)
{
this.numHash = numHash;
hash = new int[numHash];
Random r = new Random(11);
for (int i = 0; i < numHash; i++)
{
int a = (int) r.nextInt();
int b = (int) r.nextInt();
int c = (int) r.nextInt();
int x = hash(a * b * c, a, b, c);
hash[i] = x;
}
}
public double similarity(Set<T> set1, Set<T> set2)
{
int numSets = 4;
Map<T, boolean[]> bitMap = buildBitMap(set1, set2);
int[][] minHashValues = initializeHashBuckets(numSets, numHash);
computeFindSimilarForSet(set1, 0, minHashValues, bitMap);
computeFindSimilarForSet(set2, 1, minHashValues, bitMap);
return computeSimilarityFromSignatures(minHashValues, numHash);
}
private static int[][] initializeHashBuckets(int numSets,
int numHashFunctions)
{
int[][] minHashValues = new int[numSets][numHashFunctions];
for (int i = 0; i < numSets; i++)
{
for (int j = 0; j < numHashFunctions; j++)
{
minHashValues[i][j] = Integer.MAX_VALUE;
}
}
return minHashValues;
}
private static double computeSimilarityFromSignatures(
int[][] minHashValues, int numHashFunctions)
{
int identicalFindSimilares = 0;
for (int i = 0; i < numHashFunctions; i++)
{
if (minHashValues[0][i] == minHashValues[1][i])
{
identicalFindSimilares++;
}
}
return (1.0 * identicalFindSimilares) / numHashFunctions;
}
private static int hash(int x, int a, int b, int c)
{
int hashValue = (int) ((a * (x >> 4) + b * x + c) & 131071);
return Math.abs(hashValue);
}
private void computeFindSimilarForSet(Set<T> set, int setIndex,
int[][] minHashValues, Map<T, boolean[]> bitArray)
{
int index = 0;
for (T element : bitArray.keySet())
{
/*
* for every element in the bit array
*/
for (int i = 0; i < numHash; i++)
{
/*
* for every hash
*/
if (set.contains(element))
{
/*
* if the set contains the element
*/
int hindex = hash[index];
if (hindex < minHashValues[setIndex][index])
{
/*
* if current hash is smaller than the existing hash in
* the slot then replace with the smaller hash value
*/
minHashValues[setIndex][i] = hindex;
}
}
}
index++;
}
}
public Map<T, boolean[]> buildBitMap(Set<T> set1, Set<T> set2)
{
Map<T, boolean[]> bitArray = new HashMap<T, boolean[]>();
for (T t : set1)
{
bitArray.put(t, new boolean[] { true, false });
}
for (T t : set2)
{
if (bitArray.containsKey(t))
{
// item is present in set1
bitArray.put(t, new boolean[] { true, true });
}
else if (!bitArray.containsKey(t))
{
// item is not present in set1
bitArray.put(t, new boolean[] { false, true });
}
}
return bitArray;
}
public static void main(String[] args)
{
Set<String> set1 = new HashSet<String>();
set1.add("FRANCISCO");
set1.add("abc");
set1.add("SAN");
Set<String> set2 = new HashSet<String>();
set2.add("b");
set2.add("a");
set2.add("SAN");
set2.add("USA");
FindSimilar<String> minHash = new FindSimilar<String>(set1.size() + set2.size());
System.out.println("Set1 : " + set1);
System.out.println("Set2 : " + set2);
System.out.println("Similarity between two sets: "
+ minHash.similarity(set1, set2));
}
}
我需要在2套以上使用similarity
方法。问题是我无法找到办法解决所有这些问题。如果我创建for
,我不能说我想要比较set1
和seti
。我不确定我是否有意义,我必须承认我有点困惑。
该计划的目标是比较用户。用户具有联系人列表(其他用户),类似用户具有类似的联系人。每个集合都是用户,集合的内容将是他们的联系人。
答案 0 :(得分:0)
通过将所有sets
置于ArrayList
结构内,然后将其转换为实际的array
,我找到了一个(不确定的)俗气的解决方案:
ArrayList<Set<String>> list = new ArrayList<Set<String>>();
for(int i = 0; i < numPeople; i++){
Set<String> set1 = new HashSet<String>();
list.add(set1);
//another for goes here later on
}
Set<String>[] bs = list.toArray(new Set[0]);
.
.
.
public static void main(String[] args)
{
.
.
.
for(int i = 1; i<bs.length; i++){
System.out.format("Set %d: ", i+1);
System.out.println(bs[0]);
System.out.println("Similarity between two sets: "
+ minHash.similarity(bs[0], bs[i]));
}
}
这会发出The expression of type Set[] needs unchecked conversion to conform to Set<String>[]
警告,但运行正常。这正是我想要的(我仍然需要for
将数据放入sets
,但这不应该很难。如果有人能告诉我是否应该使用此解决方案或者如果有更好的选择,我想听听,因为我还在学习,任何信息都会有用。
答案 1 :(得分:0)
在集合相似性连接算法的实现中,集合通常被转换为整数数组。每个整数表示一个set元素,转换通常使用哈希映射完成。对数组进行排序,使得可以以类似合并的方式计算两个集合之间的重叠。如果您对这些算法及其修剪技术感兴趣,http://ssjoin.dbresearch.uni-salzburg.at/上的论文可能是一个良好的开端。