Question

我想基于经度和纬度对推文进行聚类，并使用OPTICS算法（Java实现），因为这似乎是基于密度的聚类的最佳选择。该算法采用输入文件来考虑点。这些文件中的每一个都是一个向量。我的数据集包含推文的纬度和经度。我可以直接使用纬度和经度来提取聚类，还是需要将纬度和经度转换为其他形式，然后才能使用OPTICS进行聚类。

提前致谢。

我的示例输入文件：

37.3456227 -121.8847222
37.3904943 -121.8854337
37.2589827 -121.8847222
37.3558627 -121.8505679
37.3189149 -121.9416226
37.3052272 -121.9871217
37.3716914 -121.8619539
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002

OPTICS算法代码段：

/**
     * Run the OPTICS algorithm
     * 
     * @param inputFile
     *            an input file path containing a list of vectors of double
     *            values
     * @param minPts
     *            the minimum number of points (see DBScan article)
     * @param epsilon
     *            the epsilon distance (see DBScan article)
     * @param seaparator
     *            the string that is used to separate double values on each line
     *            of the input file (default: single space)
     * @return a list of clusters (some of them may be empty)
     * @throws IOException
     *             exception if an error while writing the file occurs
     */
    public List<DoubleArrayOPTICS> computerClusterOrdering(String inputFile,
            int minPts, double epsilon, String separator)
            throws NumberFormatException, IOException {

        // record the start time
        timeExtractClusterOrdering = 0;
        long startTimestampClusterOrdering = System.currentTimeMillis();

        // Structure to store the vectors from the file
        List<DoubleArray> points = new ArrayList<DoubleArray>();

        // read the vectors from the input file
        BufferedReader reader = new BufferedReader(new FileReader(inputFile));
        String line;
        // for each line until the end of the file
        while (((line = reader.readLine()) != null)) {
            // if the line is a comment, is empty or is a
            // kind of metadata
            if (line.isEmpty() == true || line.charAt(0) == '#'
                    || line.charAt(0) == '%' || line.charAt(0) == '@') {
                continue;
            }
            line = line.trim();
            // split the line by spaces
            String[] lineSplited = line.split(separator);
            // create a vector of double
            double[] vector = new double[lineSplited.length];
            // for each value of the current line
            for (int i = 0; i < lineSplited.length; i++) {
                // convert to double
                double value = Double.parseDouble(lineSplited[i]);
                // add the value to the current vector
                vector[i] = value;
            }
            // add the vector to the list of vectors
            points.add(new DoubleArrayOPTICS(vector));
        }
        // close the file
        reader.close();

        // build kd-tree
        kdtree = new KDTree();
        kdtree.buildtree(points);

        // For debugging, you can print the KD-Tree by uncommenting the
        // following line:
        // System.out.println(kdtree.toString());

        // Variable to store the order of points generated by OPTICS
        clusterOrdering = new ArrayList<DoubleArrayOPTICS>();

        // For each point in the dataset
        for (DoubleArray point : points) {
            // if the node is already visited, we skip it
            DoubleArrayOPTICS pointDBS = (DoubleArrayOPTICS) point;
            if (pointDBS.visited == false) {
                // process this point
                expandClusterOrder(pointDBS, clusterOrdering, epsilon, minPts);
            }
        }

        // check memory usage
        MemoryLogger.getInstance().checkMemory();

        // record end time
        timeExtractClusterOrdering = System.currentTimeMillis()
                - startTimestampClusterOrdering;

        kdtree = null;

        // return the clusters
        return clusterOrdering;
    }

Answer 1

ELKI框架中OPTICS的标准实现与大圆距离非常吻合。是的，这是可能的。

参见例如这个答案的细节：

How to index with ELKI - OPTICS clustering

该实现也支持索引，并且非常快。

如何在OPTICS算法中使用纬度/经度进行基于密度的聚类

1 个答案: