选择具有特定阈值的重叠群集

时间:2018-06-18 17:03:22

标签: optimization cluster-analysis threshold nonlinear-optimization

说,我已经对我的数据集进行了聚类,并且有10个聚类。这些集群不重叠。但现在假设我在所有数据点中更改了某些功能并再次进行群集。现在我还有10个集群。如果我重复说再说3次,最后我会有50个星团。每个群集都有一个与其相关的分数,该分数是根据其成分数据点计算的。

这50个群集现在具有重叠的数据点。我想从这50个具有特定重叠阈值的集群中选择所有可能的集群,以获得所选集群的最高总分。

一种方法是贪婪的方法,我根据从最高到最小的分数对聚类进行排序。然后选择最高得分的群集。然后从那里继续选择在已经选择的聚类的阈值内具有重叠的聚类。但它似乎并不是最佳解决方案,尽管速度很快。

示例:假设我有3个具有以下分数的群集:

C1 =(A,B,C,D,E,F)得分= 10

C2 =(A,B,C,D)得分= 6

C3 =(D,E,F)得分= 6

允许重叠是1个元素或小于较小集群大小的40%。

贪婪的方法将返回{C1},总得分为10,而更好的选择是{C2,C3},总得分为6 + 6 = 12,重叠的元素为&#39; D&#39 1,大小(C3)= 1/3 = 33.33%<1。 40%

我正在寻找另一种能够提供最佳解决方案或更好解决方案的方法,而不是上面提到的贪婪方法。

1 个答案:

答案 0 :(得分:1)

下面的链接中给出了对问题无限制版本的答案: Selecting non-overlapping best quality clusters

您可以在上面链接中编码的模型中添加新的约束,以检查所选集群之间的重叠并通过允许的阈值对其进行限制。

这是上述问题的python代码:

from gurobipy import *
import string
# List of all subtomograms
data_points = string.ascii_uppercase[:6]
data_points = list(data_points)

# Clusters as list of lists, where each list is list of subtomograms
clusters = []
clusters.append(['A', 'B', 'C', 'D', 'E', 'F'])
clusters.append(['A', 'B', 'C', 'D'])
clusters.append(['D', 'E', 'F'])

# Create a matrix: num_subtomograms x num_clusters
matrix = {}
for dp in data_points:
    matrix[dp] = [0]*len(clusters)

# Make matrix[subtomogram_i][cluster_i] = 1, if subtomogram_i is present in cluster_i
for i in range(0, len(clusters)):
    for dp in clusters[i]:
        matrix[dp][i] = 1

# Score of each cluster in the same order as used in matrix
cost = [10, 6, 6]

# Gurobi MIP model
m = Model("Cluster selection optimization")
m.params.outputflag = 1
m.params.method = 2 # for barrier method in Gurobi, it is used to solve quadratic programming problems

# Adding a variable x where x[i] will represent whether or not ith cluster is selected or not
x = m.addVars(len(clusters), vtype=GRB.BINARY, name='x')

# generate objective function: score[0]x[0] + score[1]x[1] .....
indices = range(0, len(clusters))
coef_x = dict()
obj = 0.0
for i in indices:
    coef_x[i] = cost[i]
    obj += coef_x[i] * x[i]
m.setObjective(obj, GRB.MAXIMIZE)

# Generate constraints
threshhold = 0.4 # 40% threshold set
count = 0
m_sum = []
for i in range(len(clusters)):
    m_sum.append(sum([matrix[k][i] for k in data_points]))
for i in range(len(clusters)):
    for j in range(i+1, len(clusters)):
        if i==j:
            continue
        tmp = (sum([matrix[k][i]*matrix[k][j] for k in data_points])*x[i]*x[j] <= threshhold*min(m_sum[i], m_sum[j]))
        m.addConstr(tmp, "C"+str(count))
        count += 1

# Optimize
m.optimize()
print("Optimized")

以上运行的结果和日志数据为:

Parameter outputflag unchanged
   Value: 1  Min: 0  Max: 1  Default: 1
Changed value of parameter method to 2
   Prev: -1  Min: -1  Max: 5  Default: -1
Optimize a model with 0 rows, 3 columns and 0 nonzeros
Model has 3 quadratic constraints
Variable types: 0 continuous, 3 integer (3 binary)
Coefficient statistics:
  Matrix range     [0e+00, 0e+00]
  QMatrix range    [1e+00, 4e+00]
  Objective range  [6e+00, 1e+01]
  Bounds range     [1e+00, 1e+00]
  RHS range        [0e+00, 0e+00]
  QRHS range       [1e+00, 2e+00]
Found heuristic solution: objective -0.0000000
Modified 2 Q diagonals
Modified 2 Q diagonals
Presolve time: 0.00s
Presolved: 0 rows, 3 columns, 0 nonzeros
Variable types: 0 continuous, 3 integer (3 binary)
Presolve removed 0 rows and 3 columns
Presolve: All rows and columns removed

Root relaxation: objective 2.200000e+01, 0 iterations, 0.00 seconds

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

*    0     0               0      12.0000000   12.00000  0.00%     -    0s

Explored 0 nodes (2 simplex iterations) in 0.01 seconds
Thread count was 32 (of 32 available processors)

Solution count 2: 12 -0 

Optimal solution found (tolerance 1.00e-04)
Best objective 1.200000000000e+01, best bound 1.200000000000e+01, gap 0.0000%
Optimized
Final Obj: 12.0
1
2

还有其他解决方法,例如人工智能方法(希尔·克利伯,模拟退火等),进化优化方法(如遗传算法)(您可以根据自己的问题对其进行修改后使用NSGA2,可在{{3}上找到代码) })