说,我已经对我的数据集进行了聚类,并且有10个聚类。这些集群不重叠。但现在假设我在所有数据点中更改了某些功能并再次进行群集。现在我还有10个集群。如果我重复说再说3次,最后我会有50个星团。每个群集都有一个与其相关的分数,该分数是根据其成分数据点计算的。
这50个群集现在具有重叠的数据点。我想从这50个具有特定重叠阈值的集群中选择所有可能的集群,以获得所选集群的最高总分。
一种方法是贪婪的方法,我根据从最高到最小的分数对聚类进行排序。然后选择最高得分的群集。然后从那里继续选择在已经选择的聚类的阈值内具有重叠的聚类。但它似乎并不是最佳解决方案,尽管速度很快。
示例:假设我有3个具有以下分数的群集:
C1 =(A,B,C,D,E,F)得分= 10
C2 =(A,B,C,D)得分= 6
C3 =(D,E,F)得分= 6
允许重叠是1个元素或小于较小集群大小的40%。
贪婪的方法将返回{C1},总得分为10,而更好的选择是{C2,C3},总得分为6 + 6 = 12,重叠的元素为&#39; D&#39 1,大小(C3)= 1/3 = 33.33%<1。 40%
我正在寻找另一种能够提供最佳解决方案或更好解决方案的方法,而不是上面提到的贪婪方法。
答案 0 :(得分:1)
下面的链接中给出了对问题无限制版本的答案: Selecting non-overlapping best quality clusters
您可以在上面链接中编码的模型中添加新的约束,以检查所选集群之间的重叠并通过允许的阈值对其进行限制。
这是上述问题的python代码:
from gurobipy import *
import string
# List of all subtomograms
data_points = string.ascii_uppercase[:6]
data_points = list(data_points)
# Clusters as list of lists, where each list is list of subtomograms
clusters = []
clusters.append(['A', 'B', 'C', 'D', 'E', 'F'])
clusters.append(['A', 'B', 'C', 'D'])
clusters.append(['D', 'E', 'F'])
# Create a matrix: num_subtomograms x num_clusters
matrix = {}
for dp in data_points:
matrix[dp] = [0]*len(clusters)
# Make matrix[subtomogram_i][cluster_i] = 1, if subtomogram_i is present in cluster_i
for i in range(0, len(clusters)):
for dp in clusters[i]:
matrix[dp][i] = 1
# Score of each cluster in the same order as used in matrix
cost = [10, 6, 6]
# Gurobi MIP model
m = Model("Cluster selection optimization")
m.params.outputflag = 1
m.params.method = 2 # for barrier method in Gurobi, it is used to solve quadratic programming problems
# Adding a variable x where x[i] will represent whether or not ith cluster is selected or not
x = m.addVars(len(clusters), vtype=GRB.BINARY, name='x')
# generate objective function: score[0]x[0] + score[1]x[1] .....
indices = range(0, len(clusters))
coef_x = dict()
obj = 0.0
for i in indices:
coef_x[i] = cost[i]
obj += coef_x[i] * x[i]
m.setObjective(obj, GRB.MAXIMIZE)
# Generate constraints
threshhold = 0.4 # 40% threshold set
count = 0
m_sum = []
for i in range(len(clusters)):
m_sum.append(sum([matrix[k][i] for k in data_points]))
for i in range(len(clusters)):
for j in range(i+1, len(clusters)):
if i==j:
continue
tmp = (sum([matrix[k][i]*matrix[k][j] for k in data_points])*x[i]*x[j] <= threshhold*min(m_sum[i], m_sum[j]))
m.addConstr(tmp, "C"+str(count))
count += 1
# Optimize
m.optimize()
print("Optimized")
以上运行的结果和日志数据为:
Parameter outputflag unchanged
Value: 1 Min: 0 Max: 1 Default: 1
Changed value of parameter method to 2
Prev: -1 Min: -1 Max: 5 Default: -1
Optimize a model with 0 rows, 3 columns and 0 nonzeros
Model has 3 quadratic constraints
Variable types: 0 continuous, 3 integer (3 binary)
Coefficient statistics:
Matrix range [0e+00, 0e+00]
QMatrix range [1e+00, 4e+00]
Objective range [6e+00, 1e+01]
Bounds range [1e+00, 1e+00]
RHS range [0e+00, 0e+00]
QRHS range [1e+00, 2e+00]
Found heuristic solution: objective -0.0000000
Modified 2 Q diagonals
Modified 2 Q diagonals
Presolve time: 0.00s
Presolved: 0 rows, 3 columns, 0 nonzeros
Variable types: 0 continuous, 3 integer (3 binary)
Presolve removed 0 rows and 3 columns
Presolve: All rows and columns removed
Root relaxation: objective 2.200000e+01, 0 iterations, 0.00 seconds
Nodes | Current Node | Objective Bounds | Work
Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
* 0 0 0 12.0000000 12.00000 0.00% - 0s
Explored 0 nodes (2 simplex iterations) in 0.01 seconds
Thread count was 32 (of 32 available processors)
Solution count 2: 12 -0
Optimal solution found (tolerance 1.00e-04)
Best objective 1.200000000000e+01, best bound 1.200000000000e+01, gap 0.0000%
Optimized
Final Obj: 12.0
1
2
还有其他解决方法,例如人工智能方法(希尔·克利伯,模拟退火等),进化优化方法(如遗传算法)(您可以根据自己的问题对其进行修改后使用NSGA2,可在{{3}上找到代码) })