示例输入文件(实际输入文件包含大约50,000个条目):
615 146
615 180
615 53
615 42
615 52
615 52
615 51
615 45
615 49
616 34
616 44
616 42
616 41
616 42
617 42
617 43
617 42
685 33
685 33
685 33
686 33
686 33
687 47
687 68
737 449
737 41
737 1138
738 46
738 53
我必须将列中的每个值与相同的值进行比较,例如615,615,615必须组合在一起,群集必须包含column1值,如146,180 ..... 45,49然后群集必须打破&形成下一组相同值的另一个簇616,616,616 ..........等等
我写的代码是:
from __future__ import division
from sys import exit
h = 0
historyjobs = []
targetjobs = []
def quickzh(zhlistsub,
targetjobs=targetjobs,num=0,denom=0):
li = [] ; ji = []
j = 0
for i in zhlistsub:
x1 = targetjobs[j][0]
x = targetjobs[i][0]
num += x
denom += 1
if x1 >= 0.9 * (num/denom):#to group all items with same value in column 0
li.append(targetjobs[i][1])
else:
break
return li
def filewr(listli):
global h
s = open("newout1","a")
if(len(listli) != 0):
h += 1
s.write("cluster: %d"%h)
s.write("\n")
s.write(str(listli))
s.write("\n\n")
else:
print "0"
def new(inputfile,
historyjobs=historyjobs,targetjobs=targetjobs):
zhlistsub = [];zhlist = []
k = 0
with open(inputfile,'r') as f:
for line in f:
job = map(int,line.split())
targetjobs.append(job)
while True:
if len(targetjobs) != 0:
zhlistsub = [i for i, element in enumerate(targetjobs)]
if zhlistsub:
listrun = quickzh(zhlistsub)
filewr(listrun)
historyjobs.append(targetjobs.pop(0))
k += 1
else:
break
new('newfinal1')
我得到的输出是:
cluster: 1
[146, 180, 53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53]
cluster: 2
[180, 53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53]
cluster: 3
[53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53]
..................so on
但我需要的输出是:
cluster: 1
[146, 180, 53, 42, 52, 52, 51, 45, 49]
cluster: 2
[34, 44, 42, 41, 42]
cluster: 3
[42, 43, 42]
_____________________ so on
所以任何人都可以建议我应该做出哪些改变以获得所需的结果。这会非常有帮助吗?
答案 0 :(得分:1)
试试这个,groupby
负责创建群集,剩下要做的就是构建列表:
import itertools as it
[[y[1] for y in x[1]] for x in it.groupby(data, key=lambda x:x[0])]
以上假设data
是您输入所在的位置,并且已经过滤并按第一列排序。对于问题中的示例,它看起来像这样:
data = [[615, 146], [615, 180], [615, 53] ... ]
答案 1 :(得分:1)
尚未测试答案,但遵循此概念
import collections.defaultdict
cluster=defaultdict(list)
with open(inputfile,'r') as f:
for line in f:
clus, val = line.split()
cluster[clus].append(val)
for clus, val in cluster:
print "cluster" +str(clus)+"\n"
print str(val)+"\n"