Question

问题陈述：我有一个N排序整数数组和一个阈值K。我想对它们进行分组，对于每个元素，组均值和元素之间的差异是<= K。什么是最好的算法？

我已经研究过Jenks的自然中断和k-means聚类，但是这两者似乎更适合于你拥有所需数量的聚类的情况，而我每个聚类都有一个理想的最大方差。

// example
const distances = [5, 8, 8, 9, 16, 20, 29, 42, 56, 57, 57, 58, 103, 104, 150, 167]
const threshold = 10

// desired output:
// cluster(distances) =>
// [
//   [8, 8, 9, 5, 16, 20]
//   [29, 42]
//   [56, 57, 57, 58]
//   [103, 104]
//   [150, 167]
// ]

到目前为止，这是我的进展：https://gist.github.com/qrohlf/785c667735171b7353702cc74c10857d

我可能会尝试某种分而治之的方法来纠正我从目前的实施中得到的'球场'结果，但我真的没有看到一个很好的，干净的方法来做到这一点现在。

Answer 1

我搜索了一下，我发现了这个：具有算术平均值的未加权对组方法。这是一篇带有示例的文章：link。我认为它会对你有所帮助，看起来很容易确定你的目的。

UPGMA 算法生成有根树形图并需要恒定速率假设 - 也就是说，它假设一个超参数树，其中从根到每个分支尖端的距离相等。

Answer 2

对于其他任何人来说，这是我上面描述的UPGMA算法的（未经优化的）实现：

＆＃13;

const head = array => array[0]
const tail = array => array.slice(1)
const last = array => array[array.length - 1]
const sum = array => array.reduce((a, b) => a + b)
const avg = array => sum(array) / array.length
const minIndex = array => array.reduce((iMin, x, i) => x < array[iMin] ? i : iMin, 0)
const range = length => Array.apply(null, Array(length)).map((_, i) => i)
const isArray = Array.isArray

const distances = [5, 8, 8, 9, 16, 20, 29, 42, 56, 57, 57, 58, 103, 104, 150, 167, 800]

// cluster an array of numeric values such that the mean difference of each
// point within each cluster is within a threshold value
const cluster = (points, threshold = 10) => {
  return _cluster(points, range(points.length).map(i => [i]), threshold).map(c =>
    isArray(c) ? c.map(i => points[i]) : [points[c]])
}

// recursive call
const _cluster = (points, clusters, threshold) => {
  const matrix = getDistanceMatrix(points, clusters)
  // get the minimum col index for each row in the matrix
  const rowMinimums = matrix.map(minIndex)
  // get the index for the column containing the smallest distance
  const bestRow = minIndex(rowMinimums.map((col, row) => matrix[row][col]))
  const bestCol = rowMinimums[bestRow]
  const isValid = isValidCluster(points, mergeClusters(clusters[bestRow], clusters[bestCol]), threshold)

  if (!isValid) {
    return clusters
  }

  return _cluster(points, merge(clusters, bestRow, bestCol), threshold)
}

const isValidCluster = (points, cluster, threshold) => {
  // at this point, cluster is guaranteed to be an array, not a single point
  const distances = cluster.map(i => points[i])
  const mean = avg(distances)
  return distances.every(d => Math.abs(mean - d) <= threshold)
}

// immutable merge of indices a and b in clusters
const merge = (clusters, a, b) => {
  // merge two clusters by index
  const clusterA = clusters[a]
  const clusterB = clusters[b]
  // optimization opportunity: this filter is causing *another* iteration
  // of clusters.
  const withoutPoints = clusters.filter(c => c !== clusterA && c !== clusterB)

  return [mergeClusters(clusterA, clusterB)].concat(withoutPoints)
}

const mergeClusters = (clusterA, clusterB) => clusterA.concat(clusterB)

// optimization opportunity: this currently does 2x the work needed, since the
// distance from a->b is the same as the distance from b->a
const getDistanceMatrix = (points, clusters) => {
  // reduce clusters to distance/average distance
  const reduced = clusters.map(c => Array.isArray(c) ? avg(c.map(i => points[i])) : points[c])
  return reduced.map((i, row) => reduced.map((j, col) => (row === col) ? Infinity : Math.abs(j - i)))
}

const log2DArray = rows => console.log('[\n' + rows.map(row => '  [' + row.join(', ') + ']').join('\n') + '\n]')

console.log('clustered points:')
log2DArray(cluster(distances))

＆＃13;

根据平均差异阈值对数组进行分组

2 个答案: