Question

我想使用Apache Commons Math的DBSCANClusterer<T extends Clusterable>来通过DBSCAN算法执行聚类，但是要使用自定义距离度量，因为我的数据点包含非数值。在older version中似乎可以轻松实现这一点（请注意，该类的完全限定名称为org.apache.commons.math3.stat.clustering.DBSCANClusterer<T>，而在当前版本中为org.apache.commons.math3.ml.clustering.DBSCANClusterer<T>），现已弃用。在较旧的版本中，Clusterable将采用类型参数T，以描述要聚类的数据点的类型，而两点之间的距离将由一个人对{{1}的实现来定义}，例如：

Clusterable.distanceFrom(T)

在当前版本中，Clusterable不再参数化。这意味着人们必须想出一种将一个（可能是非数字的）数据点表示为class MyPoint implements Clusterable<MyPoint> { private String someStr = ...; private double someDouble = ...; @Override public double distanceFrom(MyPoint p) { // Arbitrary distance metric goes here, e.g.: double stringsEqual = this.someStr.equals(p.someStr) ? 0.0 : 10000.0; return stringsEqual + Math.sqrt(Math.pow(p.someDouble - this.someDouble, 2.0)); } }并从double[]返回该表示的方法，例如：

getPoint()

然后提供DistanceMeasure的实现，该实现根据要比较的两个点的class MyPoint implements Clusterable { private String someStr = ...; private double someDouble = ...; @Override public double[] getPoint() { double[] res = new double[2]; res[1] = someDouble; // obvious res[0] = ...; // some way of representing someStr as a double required return res; } }表示来定义自定义距离函数，例如：

double[]

我的数据点的形式（整数，整数，字符串，字符串）：

class CustomDistanceMeasure implements DistanceMeasure {
    @Override
    public double compute(double[] a, double[] b) {
        // Let's mimic the distance function from earlier, assuming that
        // a[0] is different from b[0] if the two 'someStr' variables were
        // different when their double representations were created.
        double stringsEqual = a[0] == b[0] ? 0.0 : 10000.0;
        return stringsEqual + Math.sqrt(Math.pow(a[1] - b[1], 2.0));
    }
}

我想使用距离函数/度量，其本质上是说“如果class MyPoint { int i1; int i2; String str1; String str2; }和str1的{{1}}和/或str2不同，则距离为最大，否则，该距离就是整数之间的欧几里得距离”，如以下代码段所示：

MyPoint mpa

问题：

我如何将MyPoint mpb表示为class Dist { static double distance(MyPoint mpa, MyPoint mpb) { if (!mpa.str1.equals(mpb.str1) || !mpa.str2.equals(mpb.str2)) { return Double.MAX_VALUE; } return Math.sqrt(Math.pow(mpa.i1 - mpb.i1, 2.0) + Math.pow(mpa.i2 - mpb.i2, 2.0)); } }，以便在Apache Commons Math的当前版本（v3.6.1）中启用上述距离度量？ String不足，因为哈希码冲突会导致不同的字符串被视为相等。这似乎是一个无法解决的问题，因为我实际上是在尝试创建从无限字符串集到有限数值集（64位double）的唯一映射。
由于（1）似乎不可能，我是否误解了如何使用该库？如果是，我是否转错了方向？
我是否可以将不赞成使用的版本用于这种距离度量标准？如果是，（3a）设计者为什么会选择降低库的通用性？也许赞成速度？也许要摆脱String.hashCode()中的自引用，有些人可能认为它是不良设计？（我意识到这可能太过分了，因此，请忽略这种情况）。对于下属的数学专家：（3b）使用不推荐使用的版本除了向前兼容之外还有什么缺点（不推荐使用的版本将在4.0中删除）？慢一点吗？也许甚至不正确？

注意：我知道ELKI显然在一组SO用户中很流行，但是它不符合我的需求，it is marketed as a command-line and GUI tool rather than a Java library to be included in third-party applications就是这样：

您甚至可以将ELKI嵌入到您的应用程序中（如果您接受   AGPL-3许可），但我们目前不（建议）这样做，   因为API仍在不断变化。 [...]

ELKI并非设计为可嵌入库。可以使用，但是   不是设计用于这种方式。 ELKI有大量的选择和   功能，这都是在运行时付出代价的（尽管   可以轻松胜过R和Weka，例如！）   尤其是代码复杂性。

ELKI设计用于数据挖掘算法的研究，而不是用于   使它们易于包含在任意应用程序中。相反，如果你   有一个特殊的问题，您应该使用ELKI找出哪个   方法行之有效，然后在优化后重新实现该方法   解决您的问题的方式（甚至在C ++中，以进一步减少   内存和运行时）。

Apache Commons Math中的DBSCAN的自定义距离度量（v3.1与v3.6）

0 个答案: