Question

希望这可以用python完成！我在相同的数据上使用了两个集群程序，现在有两个集群文件。我重新格式化了文件，使它们看起来像这样：

Cluster 0:
Brucellaceae(10)
    Brucella(10)
        abortus(1)
        canis(1)
        ceti(1)
        inopinata(1)
        melitensis(1)
        microti(1)
        neotomae(1)
        ovis(1)
        pinnipedialis(1)
        suis(1)
Cluster 1:
    Streptomycetaceae(28)
        Streptomyces(28)
            achromogenes(1)
            albaduncus(1)
            anthocyanicus(1)

etc.

这些文件包含细菌种类信息。所以我有簇号（簇0），然后就在它'家族'（布鲁氏科）的正下方和该家族中的细菌数量（10）。在那个系列中发现的属（名称后跟数字，布鲁氏菌（10）），最后是每个属中的物种（流产（1）等）。

我的问题：我有2个以这种方式格式化的文件，并希望编写一个程序来查找两者之间的差异。唯一的问题是两个程序以不同的方式聚类，因此两个聚类可能是相同的，即使实际的“聚类号”不同（因此一个文件中的聚类1的内容可能与另一个文件中的聚类43匹配，唯一不同的是实际的簇号）。所以我需要忽略集群号并关注集群内容。

有什么方法可以比较这两个文件来检查差异？它甚至可能吗？任何想法将不胜感激！

Answer 1

所以只是为了寻求帮助，因为我在评论中看到了很多不同的答案，我将为您提供一个非常非常简单的脚本实现，您可以从中开始。

请注意，此不会回答您的完整问题，但会将您指向评论中的其中一个方向。

通常情况下，如果你没有经验，我会争先恐后地读一读Python（不管怎样我会做什么，我会在答案的底部加入一些链接）

关于有趣的东西！：）

class Cluster(object):
  '''
  This is a class that will contain your information about the Clusters.
  '''
  def __init__(self, number):
    '''
    This is what some languages call a constructor, but it's not.
    This method initializes the properties with values from the method call.
    '''
    self.cluster_number = number
    self.family_name = None
    self.bacteria_name = None
    self.bacteria = []

#This part below isn't a part of the class, this is the actual script.
with open('bacteria.txt', 'r') as file:
  cluster = None
  clusters = []
  for index, line in enumerate(file):
    if line.startswith('Cluster'):
      cluster = Cluster(index)
      clusters.append(cluster)
    else:
      if not cluster.family_name:
        cluster.family_name = line
      elif not cluster.bacteria_name:
        cluster.bacteria_name = line
      else:
        cluster.bacteria.append(line)

我把它写成愚蠢而且过于简单，因为我没有任何花哨的东西和Python 2.7.2 您可以将此文件复制到.py文件中，然后直接从命令行python bacteria.py运行它。

希望这有点帮助，如果您有任何问题，请随时访问我们的Python聊天室！：）

Answer 2

假设：

file1 = '''Cluster 0:
 giant(2)
  red(2)
   brick(1)
   apple(1)
Cluster 1:
 tiny(3)
  green(1)
   dot(1)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')
file2 = '''Cluster 18:
 giant(2)
  red(2)
   brick(1)
   tomato(1)
Cluster 19:
 tiny(2)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')

这是你需要的吗？

def parse_file(open_file):
    result = []

    for line in open_file:
        indent_level = len(line) - len(line.lstrip())
        if indent_level == 0:
            levels = ['','','']
        item = line.lstrip().split('(', 1)[0]
        levels[indent_level - 1] = item
        if indent_level == 3:
            result.append('.'.join(levels))
    return result

data1 = set(parse_file(file1))
data2 = set(parse_file(file2))

differences = [
    ('common elements', data1 & data2),
    ('missing from file2', data1 - data2),
    ('missing from file1', data2 - data1) ]

要了解差异：

for desc, items in differences:
    print desc
    print 
    for item in items:
        print '\t' + item
    print

打印

common elements

    giant.red.brick
    tiny.blue.candy
    tiny.blue.flower

missing from file2

    tiny.green.dot
    giant.red.apple

missing from file1

    giant.red.tomato

Answer 3

您必须编写一些代码来解析文件。如果忽略群集，您应该能够根据缩进区分家族，属和物种。

定义named tuple的最简单方法：

import collections
Bacterium = collections.namedtuple('Bacterium', ['family', 'genera', 'species'])

你可以像这样在这个对象中实现：

b = Bacterium('Brucellaceae', 'Brucella', 'canis')

您的解析器应逐行读取文件，并设置系列和属。如果它然后找到一个物种，它应该将一个细菌添加到列表中;

with open('cluster0.txt', 'r') as infile:
    lines = infile.readlines()
family = None
genera = None
bacteria = []
for line in lines:
    # set family and genera.
    # if you detect a bacterium:
    bacteria.append(Bacterium(family, genera, species))

一旦列出了每个文件或群集中的所有细菌，您就可以选择所有细菌：

s = [b for b in bacteria if b.genera == 'Streptomycetaceae']

Answer 4

从Stackoverflow学到这么多东西后，我终于有机会回馈了！与目前提供的方法不同的是重新标记聚类以最大化对齐，然后比较变得容易。例如，如果一个算法将标签分配给一组六个项目，如L1 = [0,0,1,1,2,2]，另一个算法分配L2 = [2,2,0,0,1,1]，希望这两个标签是等价的，因为L1和L2基本上将项目分成相同的簇。这种方法将L2重新标记为最大化对齐，并且在上面的示例中，将导致L2 == L1。

我在"Menéndez, Héctor D. A genetic approach to the graph and spectral clustering problem. MS thesis. 2012."中发现了这个问题，下面是使用numpy的Python实现。我对Python比较新，所以可能有更好的实现，但我认为这可以完成工作：

def alignClusters(clstr1,clstr2):
"""Given 2 cluster assignments, this funciton will rename the second to 
   maximize alignment of elements within each cluster. This method is 
   described in in Menéndez, Héctor D. A genetic approach to the graph and 
   spectral clustering problem. MS thesis. 2012. (Assumes cluster labels
   are consecutive integers starting with zero)

   INPUTS:
   clstr1 - The first clustering assignment
   clstr2 - The second clustering assignment

   OUTPUTS:
   clstr2_temp - The second clustering assignment with clusters renumbered to
   maximize alignment with the first clustering assignment """
K = np.max(clstr1)+1
simdist = np.zeros((K,K))

for i in range(K):
    for j in range(K):
        dcix = clstr1==i
        dcjx = clstr2==j
        dd = np.dot(dcix.astype(int),dcjx.astype(int))
        simdist[i,j] = (dd/np.sum(dcix!=0) + dd/np.sum(dcjx!=0))/2
mask = np.zeros((K,K))
for i in range(K):
    simdist_vec = np.reshape(simdist.T,(K**2,1))
    I = np.argmax(simdist_vec)
    xy = np.unravel_index(I,simdist.shape,order='F')
    x = xy[0]
    y = xy[1]
    mask[x,y] = 1
    simdist[x,:] = 0
    simdist[:,y] = 0
swapIJ = np.unravel_index(np.where(mask.T),simdist.shape,order='F')
swapI = swapIJ[0][1,:]
swapJ = swapIJ[0][0,:]
clstr2_temp = np.copy(clstr2)
for k in range(swapI.shape[0]):
    swapj = [swapJ[k]==i for i in clstr2]
    clstr2_temp[swapj] = swapI[k]
return clstr2_temp

Answer 5

比较两个聚类并不是一件容易的事，并且重新发明轮子不太可能成功。看看该程序包，它具有许多不同的群集相似性指标，并且可以比较树状图（您拥有的数据结构）。

该库称为CluSim，可在以下位置找到： https://github.com/Hoosier-Clusters/clusim/

如何比较集群？

5 个答案: