根据条件形成集群?

时间:2013-09-27 03:15:20

标签: python python-2.7 python-3.x

示例输入文件(实际输入文件包含大约50,000个条目):

615 146 
615 180 
615 53  
615 42  
615 52  
615 52  
615 51  
615 45  
615 49
616 34
616 44
616 42
616 41
616 42
617 42
617 43
617 42
685 33
685 33
685 33
686 33
686 33
687 47
687 68
737 449
737 41
737 1138
738 46
738 53  

我必须将列中的每个值与相同的值进行比较,例如615,615,615必须组合在一起,群集必须包含column1值,如146,180 ..... 45,49然后群集必须打破&形成下一组相同值的另一个簇616,616,616 ..........等等

我写的代码是:

from __future__ import division
from sys import exit
h = 0
historyjobs = []
targetjobs = []


def quickzh(zhlistsub,
    targetjobs=targetjobs,num=0,denom=0):

 li = [] ; ji = []
 j = 0
 for i in zhlistsub:
    x1 = targetjobs[j][0]

    x = targetjobs[i][0]

    num += x
    denom += 1
    if x1 >= 0.9 * (num/denom):#to group all items with same value in column 0 
      li.append(targetjobs[i][1])
    else:
      break     
 return li


 def filewr(listli):
 global h
 s = open("newout1","a")
 if(len(listli) != 0):
      h += 1
      s.write("cluster: %d"%h)
      s.write("\n")
      s.write(str(listli))
      s.write("\n\n")
 else:
      print "0"


def new(inputfile,
historyjobs=historyjobs,targetjobs=targetjobs):
zhlistsub = [];zhlist = []
k = 0 

with open(inputfile,'r') as f:
    for line in f:
        job = map(int,line.split())
        targetjobs.append(job)
    while True: 
     if len(targetjobs) != 0:

       zhlistsub = [i for i, element in enumerate(targetjobs)]

       if zhlistsub:
          listrun = quickzh(zhlistsub)
          filewr(listrun)
       historyjobs.append(targetjobs.pop(0))
       k += 1
     else:
         break

new('newfinal1')
我得到的输出是:

 cluster: 1
 [146, 180, 53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53]

 cluster: 2
 [180, 53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53]

 cluster: 3
 [53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53]
 ..................so on

但我需要的输出是:

  cluster: 1
  [146, 180, 53, 42, 52, 52, 51, 45, 49]
  cluster: 2
  [34, 44, 42, 41, 42]
  cluster: 3
  [42, 43, 42]
  _____________________ so on

所以任何人都可以建议我应该做出哪些改变以获得所需的结果。这会非常有帮助吗?

2 个答案:

答案 0 :(得分:1)

试试这个,groupby负责创建群集,剩下要做的就是构建列表:

import itertools as it
[[y[1] for y in x[1]] for x in it.groupby(data, key=lambda x:x[0])]

以上假设data是您输入所在的位置,并且已经过滤并按第一列排序。对于问题中的示例,它看起来像这样:

data = [[615, 146], [615, 180], [615, 53] ... ]

答案 1 :(得分:1)

尚未测试答案,但遵循此概念

import collections.defaultdict

cluster=defaultdict(list)

with open(inputfile,'r') as f:
    for line in f:
        clus, val = line.split()
        cluster[clus].append(val)

for clus, val in cluster:
    print "cluster" +str(clus)+"\n"
    print str(val)+"\n"