问题陈述:我有一个N
排序整数数组和一个阈值K
。我想对它们进行分组,对于每个元素,组均值和元素之间的差异是<= K
。什么是最好的算法?
我已经研究过Jenks的自然中断和k-means聚类,但是这两者似乎更适合于你拥有所需数量的聚类的情况,而我每个聚类都有一个理想的最大方差。
// example
const distances = [5, 8, 8, 9, 16, 20, 29, 42, 56, 57, 57, 58, 103, 104, 150, 167]
const threshold = 10
// desired output:
// cluster(distances) =>
// [
// [8, 8, 9, 5, 16, 20]
// [29, 42]
// [56, 57, 57, 58]
// [103, 104]
// [150, 167]
// ]
到目前为止,这是我的进展:https://gist.github.com/qrohlf/785c667735171b7353702cc74c10857d
我可能会尝试某种分而治之的方法来纠正我从目前的实施中得到的'球场'结果,但我真的没有看到一个很好的,干净的方法来做到这一点现在。
答案 0 :(得分:3)
我搜索了一下,我发现了这个:具有算术平均值的未加权对组方法。 这是一篇带有示例的文章:link。我认为它会对你有所帮助,看起来很容易确定你的目的。
UPGMA 算法生成有根树形图并需要恒定速率假设 - 也就是说,它假设一个超参数树,其中从根到每个分支尖端的距离相等。
答案 1 :(得分:1)
对于其他任何人来说,这是我上面描述的UPGMA算法的(未经优化的)实现:
const head = array => array[0]
const tail = array => array.slice(1)
const last = array => array[array.length - 1]
const sum = array => array.reduce((a, b) => a + b)
const avg = array => sum(array) / array.length
const minIndex = array => array.reduce((iMin, x, i) => x < array[iMin] ? i : iMin, 0)
const range = length => Array.apply(null, Array(length)).map((_, i) => i)
const isArray = Array.isArray
const distances = [5, 8, 8, 9, 16, 20, 29, 42, 56, 57, 57, 58, 103, 104, 150, 167, 800]
// cluster an array of numeric values such that the mean difference of each
// point within each cluster is within a threshold value
const cluster = (points, threshold = 10) => {
return _cluster(points, range(points.length).map(i => [i]), threshold).map(c =>
isArray(c) ? c.map(i => points[i]) : [points[c]])
}
// recursive call
const _cluster = (points, clusters, threshold) => {
const matrix = getDistanceMatrix(points, clusters)
// get the minimum col index for each row in the matrix
const rowMinimums = matrix.map(minIndex)
// get the index for the column containing the smallest distance
const bestRow = minIndex(rowMinimums.map((col, row) => matrix[row][col]))
const bestCol = rowMinimums[bestRow]
const isValid = isValidCluster(points, mergeClusters(clusters[bestRow], clusters[bestCol]), threshold)
if (!isValid) {
return clusters
}
return _cluster(points, merge(clusters, bestRow, bestCol), threshold)
}
const isValidCluster = (points, cluster, threshold) => {
// at this point, cluster is guaranteed to be an array, not a single point
const distances = cluster.map(i => points[i])
const mean = avg(distances)
return distances.every(d => Math.abs(mean - d) <= threshold)
}
// immutable merge of indices a and b in clusters
const merge = (clusters, a, b) => {
// merge two clusters by index
const clusterA = clusters[a]
const clusterB = clusters[b]
// optimization opportunity: this filter is causing *another* iteration
// of clusters.
const withoutPoints = clusters.filter(c => c !== clusterA && c !== clusterB)
return [mergeClusters(clusterA, clusterB)].concat(withoutPoints)
}
const mergeClusters = (clusterA, clusterB) => clusterA.concat(clusterB)
// optimization opportunity: this currently does 2x the work needed, since the
// distance from a->b is the same as the distance from b->a
const getDistanceMatrix = (points, clusters) => {
// reduce clusters to distance/average distance
const reduced = clusters.map(c => Array.isArray(c) ? avg(c.map(i => points[i])) : points[c])
return reduced.map((i, row) => reduced.map((j, col) => (row === col) ? Infinity : Math.abs(j - i)))
}
const log2DArray = rows => console.log('[\n' + rows.map(row => ' [' + row.join(', ') + ']').join('\n') + '\n]')
console.log('clustered points:')
log2DArray(cluster(distances))
&#13;