我想基于经度和纬度对推文进行聚类,并使用OPTICS算法(Java实现),因为这似乎是基于密度的聚类的最佳选择。该算法采用输入文件来考虑点。这些文件中的每一个都是一个向量。我的数据集包含推文的纬度和经度。我可以直接使用纬度和经度来提取聚类,还是需要将纬度和经度转换为其他形式,然后才能使用OPTICS进行聚类。
提前致谢。
我的示例输入文件:
37.3456227 -121.8847222
37.3904943 -121.8854337
37.2589827 -121.8847222
37.3558627 -121.8505679
37.3189149 -121.9416226
37.3052272 -121.9871217
37.3716914 -121.8619539
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
OPTICS算法代码段:
/**
* Run the OPTICS algorithm
*
* @param inputFile
* an input file path containing a list of vectors of double
* values
* @param minPts
* the minimum number of points (see DBScan article)
* @param epsilon
* the epsilon distance (see DBScan article)
* @param seaparator
* the string that is used to separate double values on each line
* of the input file (default: single space)
* @return a list of clusters (some of them may be empty)
* @throws IOException
* exception if an error while writing the file occurs
*/
public List<DoubleArrayOPTICS> computerClusterOrdering(String inputFile,
int minPts, double epsilon, String separator)
throws NumberFormatException, IOException {
// record the start time
timeExtractClusterOrdering = 0;
long startTimestampClusterOrdering = System.currentTimeMillis();
// Structure to store the vectors from the file
List<DoubleArray> points = new ArrayList<DoubleArray>();
// read the vectors from the input file
BufferedReader reader = new BufferedReader(new FileReader(inputFile));
String line;
// for each line until the end of the file
while (((line = reader.readLine()) != null)) {
// if the line is a comment, is empty or is a
// kind of metadata
if (line.isEmpty() == true || line.charAt(0) == '#'
|| line.charAt(0) == '%' || line.charAt(0) == '@') {
continue;
}
line = line.trim();
// split the line by spaces
String[] lineSplited = line.split(separator);
// create a vector of double
double[] vector = new double[lineSplited.length];
// for each value of the current line
for (int i = 0; i < lineSplited.length; i++) {
// convert to double
double value = Double.parseDouble(lineSplited[i]);
// add the value to the current vector
vector[i] = value;
}
// add the vector to the list of vectors
points.add(new DoubleArrayOPTICS(vector));
}
// close the file
reader.close();
// build kd-tree
kdtree = new KDTree();
kdtree.buildtree(points);
// For debugging, you can print the KD-Tree by uncommenting the
// following line:
// System.out.println(kdtree.toString());
// Variable to store the order of points generated by OPTICS
clusterOrdering = new ArrayList<DoubleArrayOPTICS>();
// For each point in the dataset
for (DoubleArray point : points) {
// if the node is already visited, we skip it
DoubleArrayOPTICS pointDBS = (DoubleArrayOPTICS) point;
if (pointDBS.visited == false) {
// process this point
expandClusterOrder(pointDBS, clusterOrdering, epsilon, minPts);
}
}
// check memory usage
MemoryLogger.getInstance().checkMemory();
// record end time
timeExtractClusterOrdering = System.currentTimeMillis()
- startTimestampClusterOrdering;
kdtree = null;
// return the clusters
return clusterOrdering;
}
答案 0 :(得分:0)
ELKI框架中OPTICS的标准实现与大圆距离非常吻合。是的,这是可能的。
参见例如这个答案的细节:
How to index with ELKI - OPTICS clustering
该实现也支持索引,并且非常快。