Question

说，我已经对我的数据集进行了聚类，并且有10个聚类。这些集群不重叠。但现在假设我在所有数据点中更改了某些功能并再次进行群集。现在我还有10个集群。如果我重复说再说3次，最后我会有50个星团。每个群集都有一个与其相关的分数，该分数是根据其成分数据点计算的。

这50个群集现在具有重叠的数据点。我想从这50个具有特定重叠阈值的集群中选择所有可能的集群，以获得所选集群的最高总分。

一种方法是贪婪的方法，我根据从最高到最小的分数对聚类进行排序。然后选择最高得分的群集。然后从那里继续选择在已经选择的聚类的阈值内具有重叠的聚类。但它似乎并不是最佳解决方案，尽管速度很快。

示例：假设我有3个具有以下分数的群集：

C1 =（A，B，C，D，E，F）得分= 10

C2 =（A，B，C，D）得分= 6

C3 =（D，E，F）得分= 6

允许重叠是1个元素或小于较小集群大小的40％。

贪婪的方法将返回{C1}，总得分为10，而更好的选择是{C2，C3}，总得分为6 + 6 = 12，重叠的元素为＆＃39; D＆＃39 1，大小（C3）= 1/3 = 33.33％<1。 40％

我正在寻找另一种能够提供最佳解决方案或更好解决方案的方法，而不是上面提到的贪婪方法。

Answer 1

下面的链接中给出了对问题无限制版本的答案： Selecting non-overlapping best quality clusters

您可以在上面链接中编码的模型中添加新的约束，以检查所选集群之间的重叠并通过允许的阈值对其进行限制。

这是上述问题的python代码：

from gurobipy import *
import string
# List of all subtomograms
data_points = string.ascii_uppercase[:6]
data_points = list(data_points)

# Clusters as list of lists, where each list is list of subtomograms
clusters = []
clusters.append(['A', 'B', 'C', 'D', 'E', 'F'])
clusters.append(['A', 'B', 'C', 'D'])
clusters.append(['D', 'E', 'F'])

# Create a matrix: num_subtomograms x num_clusters
matrix = {}
for dp in data_points:
    matrix[dp] = [0]*len(clusters)

# Make matrix[subtomogram_i][cluster_i] = 1, if subtomogram_i is present in cluster_i
for i in range(0, len(clusters)):
    for dp in clusters[i]:
        matrix[dp][i] = 1

# Score of each cluster in the same order as used in matrix
cost = [10, 6, 6]

# Gurobi MIP model
m = Model("Cluster selection optimization")
m.params.outputflag = 1
m.params.method = 2 # for barrier method in Gurobi, it is used to solve quadratic programming problems

# Adding a variable x where x[i] will represent whether or not ith cluster is selected or not
x = m.addVars(len(clusters), vtype=GRB.BINARY, name='x')

# generate objective function: score[0]x[0] + score[1]x[1] .....
indices = range(0, len(clusters))
coef_x = dict()
obj = 0.0
for i in indices:
    coef_x[i] = cost[i]
    obj += coef_x[i] * x[i]
m.setObjective(obj, GRB.MAXIMIZE)

# Generate constraints
threshhold = 0.4 # 40% threshold set
count = 0
m_sum = []
for i in range(len(clusters)):
    m_sum.append(sum([matrix[k][i] for k in data_points]))
for i in range(len(clusters)):
    for j in range(i+1, len(clusters)):
        if i==j:
            continue
        tmp = (sum([matrix[k][i]*matrix[k][j] for k in data_points])*x[i]*x[j] <= threshhold*min(m_sum[i], m_sum[j]))
        m.addConstr(tmp, "C"+str(count))
        count += 1

# Optimize
m.optimize()
print("Optimized")

以上运行的结果和日志数据为：

Parameter outputflag unchanged
   Value: 1  Min: 0  Max: 1  Default: 1
Changed value of parameter method to 2
   Prev: -1  Min: -1  Max: 5  Default: -1
Optimize a model with 0 rows, 3 columns and 0 nonzeros
Model has 3 quadratic constraints
Variable types: 0 continuous, 3 integer (3 binary)
Coefficient statistics:
  Matrix range     [0e+00, 0e+00]
  QMatrix range    [1e+00, 4e+00]
  Objective range  [6e+00, 1e+01]
  Bounds range     [1e+00, 1e+00]
  RHS range        [0e+00, 0e+00]
  QRHS range       [1e+00, 2e+00]
Found heuristic solution: objective -0.0000000
Modified 2 Q diagonals
Modified 2 Q diagonals
Presolve time: 0.00s
Presolved: 0 rows, 3 columns, 0 nonzeros
Variable types: 0 continuous, 3 integer (3 binary)
Presolve removed 0 rows and 3 columns
Presolve: All rows and columns removed

Root relaxation: objective 2.200000e+01, 0 iterations, 0.00 seconds

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

*    0     0               0      12.0000000   12.00000  0.00%     -    0s

Explored 0 nodes (2 simplex iterations) in 0.01 seconds
Thread count was 32 (of 32 available processors)

Solution count 2: 12 -0 

Optimal solution found (tolerance 1.00e-04)
Best objective 1.200000000000e+01, best bound 1.200000000000e+01, gap 0.0000%
Optimized
Final Obj: 12.0
1
2

还有其他解决方法，例如人工智能方法（希尔·克利伯，模拟退火等），进化优化方法（如遗传算法）（您可以根据自己的问题对其进行修改后使用NSGA2，可在{{3}上找到代码） }）

选择具有特定阈值的重叠群集

1 个答案: