Question

我一直在考虑它，但已经没有想法了。我有10个长度为18的数组，其中包含18个双精度值。这18个值是图像的特征。现在我必须对它们应用k-means聚类。

为了实现k-means聚类，我需要为每个数组提供唯一的计算值。是否有任何数学或统计学或任何逻辑可以帮助我为每个数组创建计算值 ，它基于其中的值 是唯一的。提前致谢。

这是我的数组示例。还有10个

[0.07518284315321135    
0.002987851573676068    
0.002963866526639678    
0.002526139418225552    
0.07444872939213325 
0.0037219653347541617   
0.0036979802877177715   
0.0017920256571474585   
0.07499695903867931 
0.003477831820276616    
0.003477831820276616    
0.002036159171625004    
0.07383539747505984 
0.004311312204791184    
0.0043352972518275745   
0.0011786937400740452   
0.07353130134299131 
0.004339580295941216]

Answer 1

您是否检查过Java 7中的Arrays.hashcode？

 /**
 * Returns a hash code based on the contents of the specified array.
 * For any two <tt>double</tt> arrays <tt>a</tt> and <tt>b</tt>
 * such that <tt>Arrays.equals(a, b)</tt>, it is also the case that
 * <tt>Arrays.hashCode(a) == Arrays.hashCode(b)</tt>.
 *
 * <p>The value returned by this method is the same value that would be
 * obtained by invoking the {@link List#hashCode() <tt>hashCode</tt>}
 * method on a {@link List} containing a sequence of {@link Double}
 * instances representing the elements of <tt>a</tt> in the same order.
 * If <tt>a</tt> is <tt>null</tt>, this method returns 0.
 *
 * @param a the array whose hash value to compute
 * @return a content-based hash code for <tt>a</tt>
 * @since 1.5
 */
public static int hashCode(double a[]) {
    if (a == null)
        return 0;

    int result = 1;
    for (double element : a) {
        long bits = Double.doubleToLongBits(element);
        result = 31 * result + (int)(bits ^ (bits >>> 32));
    }
    return result;
}

我不明白为什么@ Marco13提及＆＃34;这并没有为数组返回unquie＆＃34;。

更新

请参阅{Macro13评论the reason为什么它不能解开..

<强>更新

如果我们使用您的输入点绘制图形，（18个元素）有一个尖峰和3个低值，并且模式变为.. 如果确实如此..您可以找到峰值的平均值（1,4,8,12,16），并从剩余值中找出低平均值。

这样你就会有峰均值和低均值。并且您发现用于表示这两者的unquie数字也使用here

中描述的双射算法来保留值

这个Alogirthm还提供了反转的公式，即从不等值中取出峰值和低值。

要查找唯一对< x; y >= x + (y + ( (( x +1 ) /2) * (( x +1 ) /2) ) )

另请参阅第2页的pdf中的练习1来反转x和y。

寻找平均值并找到配对值。

public static double mean(double[] array){
    double peakMean = 0;
    double lowMean = 0;
    for (int i = 0; i < array.length; i++) {
        if ( (i+1) % 4 == 0 || i == 0){
            peakMean = peakMean + array[i];
        }else{
            lowMean = lowMean + array[i];
        }
    }
    peakMean = peakMean / 5;
    lowMean = lowMean / 13;
    return bijective(lowMean, peakMean);
}



public static double bijective(double x,double y){
    double tmp = ( y +  ((x+1)/2));
    return x +  ( tmp * tmp);
}

进行测试

public static void main(String[] args) {
    double[] arrays = {0.07518284315321135,0.002963866526639678,0.002526139418225552,0.07444872939213325,0.0037219653347541617,0.0036979802877177715,0.0017920256571474585,0.07499695903867931,0.003477831820276616,0.003477831820276616,0.002036159171625004,0.07383539747505984,0.004311312204791184,0.0043352972518275745,0.0011786937400740452,0.07353130134299131,0.004339580295941216};
    System.out.println(mean(arrays));
}

您可以使用峰值和低值来查找相似的图像。

Answer 2

您可以简单地对值进行求和，使用双精度，结果值将是唯一的大多数次。另一方面，如果值位置相关，则可以使用索引作为乘数来应用总和。

代码可以简单如下：

public static double sum(double[] values) {
    double val = 0.0;
    for (double d : values) {
        val += d;
    }
    return val;
}

public static double hash_w_order(double[] values) {
    double val = 0.0;
    for (int i = 0; i < values.length; i++) {
        val += values[i] * (i + 1);
    }
    return val;
}

public static void main(String[] args) {
    double[] myvals =
        { 0.07518284315321135, 0.002987851573676068, 0.002963866526639678, 0.002526139418225552, 0.07444872939213325, 0.0037219653347541617, 0.0036979802877177715, 0.0017920256571474585, 0.07499695903867931, 0.003477831820276616,
                0.003477831820276616, 0.002036159171625004, 0.07383539747505984, 0.004311312204791184, 0.0043352972518275745, 0.0011786937400740452, 0.07353130134299131, 0.004339580295941216 };

    System.out.println("Computed value based on sum: " + sum(myvals));
    System.out.println("Computed value based on values and its position: " + hash_w_order(myvals));
}

使用您的值列表输出该代码：

Computed value based on sum: 0.41284176550504803
Computed value based on values and its position: 3.7396448842464496

Answer 3

嗯，这是一种适用于任意数量双打的方法。

public BigInteger uniqueID(double[] array) {
    final BigInteger twoToTheSixtyFour = 
            BigInteger.valueOf(Long.MAX_VALUE).add(BigInteger.ONE);

    BigInteger count = BigInteger.ZERO;
    for (double d : array) {
        long bitRepresentation = Double.doubleToRawLongBits(d);
        count = count.multiply(twoToTheSixtyFour);
        count = count.add(BigInteger.valueOf(bitRepresentation));
    }
    return count;
}

说明

每个double是一个64位值，这意味着有2 ^ 64个不同的可能双值。由于long对于这类事情更容易使用，并且它的位数相同，因此我们可以使用Double.doubleToRawLongBits(double)从双精度到长整数进行1对1映射

这太棒了，因为现在我们可以将它视为一个简单的组合问题。你知道怎么知道1234是一个唯一的号码吗？没有其他数字具有相同的值。这是因为我们可以通过它的数字来分解它：

1234 = 1 * 10^3 + 2 * 10^2 + 3 * 10^1 + 4 * 10^0

10的权力将是＆＃34;基础＆＃34;如果你知道线性代数，那么基数为10的编号系统的元素。以这种方式，基数为10的数字类似于仅由0到9的值组成的数组。

如果我们想要双数组类似的东西，我们可以讨论基数（2 ^ 64）编号系统。每个double值将是值的基数（2 ^ 64）表示中的数字。如果有18位数，则长度为18的double[]有（2 ^ 64）^ 18个唯一值。

这个数字是巨大的，所以我们需要用BigInteger数据结构而不是原始数字来表示它。这个数字有多大？

（2 ^ 64）^ 18 = 61172327492847069472032393719205726809135813743440799050195397570919697796091958321786863938157971792315844506873509046544459008355036150650333616890210625686064472971480622053109783197015954399612052812141827922088117778074833698589048132156300022844899841969874763871624802603515651998113045708569927237462546233168834543264678118409417047146496

有18个长度的双数组的许多独特配置，这个代码可以让你唯一地描述它们。

Answer 4

我将提出三种方法，我将概述不同的优点和缺点。

哈希码 这是明显的“解决方案”，尽管已经正确地指出它不会是唯一的。但是，任何两个数组都不太可能具有相同的值。
加权总和 你的元素似乎是有界的;也许它们的范围从最小值0到最大值1.如果是这种情况，你可以将第一个数乘以N ^ 0，第二个乘以N ^ 1，第三个乘以N ^ 2，依此类推，其中N是一些大数（理想情况下是你的精度的倒数）。这很容易实现，特别是如果你使用矩阵包，并且非常快。如果我们选择，我们可以使这个独特。
欧氏距离平均值 从每个数组中减去数组的平均值，对结果求平方，对平方求和。如果你有预期的平均值，你可以使用它。同样，不是唯一的，会有碰撞，但你（几乎）无法避免这种情况。

唯一性的难度

已经解释过散列不会给你一个独特的解决方案。理论上，使用加权和可以使用唯一数字，但我们必须使用非常大尺寸的数字。假设您的数字在内存中是64位。这意味着它们可以表示2 ^ 64个可能的数字（使用浮点数略少）。阵列中的18个这样的数字可以表示2 ^（64 * 18）个不同的数字。那太大了。如果您使用更少的东西，由于鸽笼原则，您将无法保证唯一性。

让我们看一个简单的例子。如果您有四个字母a，b，c和d，并且您必须使用数字1到3对每个字母进行编号，则不能。那就是鸽子原则。你有2 ^（18 * 64）个可能的数字。你不能使用少于2 ^（18 * 64）的数字对它们进行唯一编号，并且哈希不会给你这个数字。

如果使用 BigDecimal ，则可以表示（几乎）任意大数。如果您可以得到的最大元素是1且最小的0，那么您可以设置N = 1 /（精度）并应用上面提到的加权和。这将保证唯一性。 Java中双精度的精度为Double.MIN_VALUE。请注意，权重数组需要存储在_Big Decimal_s！

这满足了你的这部分问题：

为每个数组创建一个计算值，这是唯一的基于其中的值

然而，有一个问题：

1和2吮吸K手段

我在与Marco 13的讨论中假设您正在对单个值执行聚类，而不是长度为18的数组。正如马可已经提到的那样，Hashing对于K意味着糟透了。整个想法是，数据中最小的变化将导致哈希值发生很大变化。这意味着两个相似的图像产生两个非常相似的数组，产生两个非常不同的“唯一”数字。 不保留相似性。结果将是伪随机!!!

加权总和更好，但仍然很糟糕。它基本上会忽略除最后一个元素之外的所有元素，除非最后一个元素是相同的。只有这样才会看到倒数第二个，依此类推。相似性并没有真正保留下来。

欧几里德距离均值（或至少某个点）的距离至少会以一种合理的方式将事物组合在一起。方向将被忽略，但至少远离平均值的东西不会与接近的东西分组。保留一个特征的相似性，其他特征丢失。

总结

1非常简单，但不是唯一，不会保持相似性。

2很容易，可以独特，不会保持相似性。

3很容易，但不唯一，保留了一些相似性。

加权和的实现。没有真正测试过。

public class Array2UniqueID {

private final double min;
private final double max;
private final double prec;
private final int length;

/**
 * Used to provide a {@code BigInteger} that is unique to the given array.
 * <p>
 * This uses weighted sum to guarantee that two IDs match if and only if
 * every element of the array also matches. Similarity is not preserved.
 *
 * @param min smallest value an array element can possibly take
 * @param max largest value an array element can possibly take
 * @param prec smallest difference possible between two array elements
 * @param length length of each array
 */
public Array2UniqueID(double min, double max, double prec, int length) {
    this.min = min;
    this.max = max;
    this.prec = prec;
    this.length = length;
}

/**
 * A convenience constructor which assumes the array consists of doubles of
 * full range.
 * <p>
 * This will result in very large IDs being returned.
 *
 * @see Array2UniqueID#Array2UniqueID(double, double, double, int)
 * @param length
 */
public Array2UniqueID(int length) {
    this(-Double.MAX_VALUE, Double.MAX_VALUE, Double.MIN_VALUE, length);
}

public BigDecimal createUniqueID(double[] array) {
    // Validate the data
    if (array.length != length) {
        throw new IllegalArgumentException("Array length must be "
                + length + " but was " + array.length);
    }
    for (double d : array) {
        if (d < min || d > max) {
            throw new IllegalArgumentException("Each element of the array"
                    + " must be in the range [" + min + ", " + max + "]");
        }
    }

    double range = max - min;

    /* maxNums is the maximum number of numbers that could possibly exist
     * between max and min.
     * The ID will be in the range 0 to maxNums^length.
     * maxNums = range / prec + 1
     * Stored as a BigDecimal for convenience, but is an integer
     */
    BigDecimal maxNums = BigDecimal.valueOf(range)
            .divide(BigDecimal.valueOf(prec))
            .add(BigDecimal.ONE);
    // For convenience

    BigDecimal id = BigDecimal.valueOf(0);

    // 2^[ (el-1)*length + i ]
    for (int i = 0; i < array.length; i++) {
        BigDecimal num = BigDecimal.valueOf(array[i])
                .divide(BigDecimal.valueOf(prec))
                .multiply(maxNums).pow(i);

        id = id.add(num);
    }

    return id;

}

Answer 5

根据我的理解，你将根据双重值进行k聚类。

为什么不在一个对象中包含double值，包含数组和位置标识符，以便知道它最终在哪个集群中？

类似的东西：

 public class Element {
     final public double value;
     final public int array;
     final public int position;
     public Element(double value, int array, int position) {
         this.value = value;
         this.array = array;
         this.position = position;
     }
 }

如果你需要整个数组，

您可以将长度为18的原始数组转换为长度为19的数组，其中last或first元素为唯一ID，您将在群集期间忽略，但在群集完成后可以参考。这样，它的内存占用量很小 - 数组有8个额外字节，并且与原始值很容易关联。
如果空间绝对是一个问题，并且您拥有的数组的所有值都小于1，则可以根据除法提示为1添加唯一ID（大于或等于1）和群集，0.07518284315321135第一个为0.07518284315321135，第二个为0.07518284315321135为1.07518284315321135，但这会增加聚类过程中计算的复杂性。

Answer 6

首先，让我们尝试用数学方法理解你需要的东西：

将m个实数的数组唯一映射到一个数字实际上是R^m和R之间或至少N之间的双射。

由于浮点数实际上是有理数，因此您的问题是在Q^m和N之间找到一个可以转换为N^n到N的双射，因为你知道你的值总是大于0（只需将你的值乘以精度）。

因此，您需要将N^m映射到N。查看Cantor Pairing Function了解一些想法

Answer 7

基于数组生成唯一结果的保证方法是将其转换为一个大字符串，并将其用于计算值。

可能很慢，但根据数组的值，它会是唯一的。

实施示例： Best way to convert an ArrayList to a string

阵列的唯一计算值

7 个答案:

说明