使用bash或python(2.4.x)
我有一个文件 - 文件中大约有100行,文件结构如下。
aaaaa, 100
aaaab, 75
aaaac, 150
aaaad, 135
aaaae, 144
aaaaf, 12
aaaag, 5
aaaah, 34
aaaai, 11
aaaaj, 43
aaaak, 88
aaaal, 3
baaaa, 25
baaab, 33
baaac, 87
baaad, 111
baaae, 45
baaaf, 99
baaag, 71
baaah, 68
baaai, 168
baaaj, 21
baaak, 11
baaal, 47
caaaa, 59
caaab, 85
caaac, 77
caaad, 33
caaae, 44
caaaf, 16
caaag, 111
caaah, 141
caaai, 87
caaaj, 59
caaak, 89
caaal, 3
我想做的是将它分成12列,每列的传感器数量大致相同,每列的总和接近相同。
换句话说,如果我采用上面的列表并将其拆分为此类。
aaaaa 100 aaaab 75 baaab 33
aaaai 11 baaah 68 baaac 87
aaaak 88 caaaa 59 caaac 77
199 202 197
aaaah 34 baaaf 99 caaad 33
baaad 111 baaal 47 aaaac 150
aaaaj 43 caaae 44 caaaf 16
188 190 199
aaaag 5 aaaaf 12 baaaa 25
aaaad 135 caaai 87 caaag 111
caaaa 59 caaak 89 baaag 71
199 188 207
aaaae 144 baaaj 21 caaaj 59
aaaal 3 baaak 11 caaah 141
baaae 45 baaai 168 caaal 3
192 200 203
它产生12列相等的项目并且非常接近均值。
我可以手动完成,但我们最终需要这样做几次。我甚至不确定从哪里开始除了把它变成一个数组,计算数组中的项目并进行随机分割。仍然坚持价值平衡。
任何指针?
答案 0 :(得分:3)
如果您想要最佳解决方案,这对于大型输入来说并不会很有趣。你正在寻找一些符合CS中一些非常着名的难题的东西 - Knapsack,Bin Packing等。一些更简单,不太完美的解决方案可能足够接近。
这不准确但是,鉴于您的示例数据集,我设法通过一种非常简单的方法获得214,197,194,199,205,182,195,192,199,199,206,208的大小。它可能适用也可能不适用于实际数据。
方法是:
随着您接近最佳分区,解决方案会变得更加复杂。
答案 1 :(得分:1)
有趣的问题,我认为你很难找到最好的解决方案。您可以计算每次拆分的项目数和它们应具有的平均值。对整数上的项目进行排序并获取最大数字,同时该值仍然低于平均值,然后重复此过程,直到您只需要再添加一个项目,现在选择最小项目并尝试尽可能接近平均值(过度或低于无关紧要)。
如果除了最新版本之外的任何步骤(例如价值>平均值),请返回并选择下一个最大值。
答案 2 :(得分:1)
我写了两个非常简单的实现。第一个使用双端队列从右侧和左侧弹出(一旦列表被排序),将低值放置为高值。第二个是@Sean McSomething建议的那个。
这是代码(quick'n'dirty - 遗憾的是几条评论):
import math
import itertools
import collections
def sum_column(data):
return sum(zip(*data)[1], 0.0)
def split_groups(sensors):
sensors.sort(key=lambda item: item[1], reverse=True)
per_group = len(sensors) // 12
average = sum_column(sensors) / len(sensors)
data = collections.deque(sensors)
groups = [[] for i in xrange(12)]
cycle = itertools.cycle(groups)
try:
while True:
current = cycle.next()
if len(current) == per_group - 1:
if sum_column(current) < average:
current.append(data.popleft())
else:
current.append(data.pop())
continue
current.append(data.popleft())
current.append(data.pop())
except IndexError:
return groups
def split_groups2(sensors):
sensors.sort(key=lambda item: item[1], reverse=True)
groups = [[] for i in xrange(12)]
cycle = itertools.cycle(groups)
per_group = int(math.ceil(len(sensors) / 3.))
partitions = [sensors[i:i + per_group] for i in xrange(0, len(sensors)
per_group)]
medium, low = map(reversed, partitions[1:])
for sensor, value in itertools.chain(partitions[0], medium, low):
cycle.next().append((sensor, value))
return groups
def format_groups(result):
ret = []
for group in result:
tmp = []
tmp.append('\n'.join('{0} {1}'.format(k, v) for k, v in group))
tmp.append(' ' * 8 + str(int(sum_column(group))))
ret.append('\n'.join(tmp))
return '\n\n'.join(ret)
if __name__ == '__main__':
import sys
implementation = split_groups
if '--second' in sys.argv:
sys.argv.remove('--second')
implementation = split_groups2
with open(sys.argv[1]) as fobj:
sensors = []
for line in fobj:
sensor, value = line.strip().split(', ')
sensors.append((sensor, int(value)))
sys.stdout.write(format_groups(split_groups(sensors)))
sys.stdout.write('\n')
在一个要点中:
https://gist.github.com/2703965
我把简单的部分(格式化)留给了你。现在它只是垂直打印(而不是你所要求的orizontally)。这应该不会太难。
这是它可以达到的最佳效果(两种实现方式):
(max(sums) - min(sums)) / 2. = 16.0
这远不是这个例子,但它是一个开始。您可以从命令行使用文件名和可选的--second
开关启动它(以使用第二个实现)。我可以使用命令行解析器,但我习惯了argparse,这在Python 2.4中不存在。所以我只是去找那个尴尬的黑客。
示例运行:
$ python2 groupit.py filename.txt
baaai 168
caaal 3
aaaaj 43
214
aaaac 150
aaaal 3
caaae 44
197
aaaae 144
aaaag 5
baaae 45
194
caaah 141
baaak 11
baaal 47
199
aaaad 135
aaaai 11
caaaj 59
205
baaad 111
aaaaf 12
caaaa 59
182
caaag 111
caaaf 16
baaah 68
195
aaaaa 100
baaaj 21
baaag 71
192
baaaf 99
baaaa 25
aaaab 75
199
caaak 89
caaad 33
caaac 77
199
aaaak 88
baaab 33
caaab 85
206
baaac 87
aaaah 34
caaai 87
208
$ python2 groupit.py --second filename.txt
baaai 168
caaal 3
aaaaj 43
214
aaaac 150
aaaal 3
caaae 44
197
aaaae 144
aaaag 5
baaae 45
194
caaah 141
baaak 11
baaal 47
199
aaaad 135
aaaai 11
caaaj 59
205
baaad 111
aaaaf 12
caaaa 59
182
caaag 111
caaaf 16
baaah 68
195
aaaaa 100
baaaj 21
baaag 71
192
baaaf 99
baaaa 25
aaaab 75
199
caaak 89
caaad 33
caaac 77
199
aaaak 88
baaab 33
caaab 85
206
baaac 87
aaaah 34
caaai 87
208
通过问题中的示例,两个算法给出了相同的答案。如果您可以提供更多测试用例,我会尝试改进它们。我在Python 2.7上测试了脚本,因为我没有安装2.4 对不起,答案很长。
答案 3 :(得分:0)
这是我到目前为止所提出的。我认为它非常接近,没有真正高价值的东西抵消了一点 - 它很接近。
很高兴听到任何建议让它更加pythonic。
file:columnsplit.py
#!/usr/bin/python
import sys, operator
# usage
# columnsplit.py <filename> <#cols>
# columnsplit.py test.csv 12
#
#determine number of devices per column
def devicelisting(fulllist,percolumn):
devicelist=[]
fobj=open(fulllist,'r')
for line in fobj:
(key, val) = line.split(',')
devicelist.append((key,int(val)))
devicespercol=(len(devicelist)/int(percolumn))
return(devicelist,devicespercol)
def devicesplit(fulllist,numcolumns,roundnum):
if roundnum == 0:
devices=sorted(fulllist, key=lambda device: device[1], reverse=True)
devicestemp=devices
else:
devices=sorted(fulllist, key=lambda device: device[1])
devicestemp=devices
deviceslice=[]
for idx, val in zip(range(numcolumns), devices):
deviceslice.append(val)
devicestemp.remove(val)
return(deviceslice,devicestemp)
def makecolumns(roundnumber,percol):
column=[]
for i in range(percol):
exec('tempslice=deviceslice%s' % i)
column.append(tempslice[roundnumber])
return(column)
# what this is going to do is generate how many devices will fill each of the intended
# number of columns. What is left over will be run again against the lowest value of columns
if __name__ == '__main__':
tempslice=[]
devices,percol=devicelisting(sys.argv[1],sys.argv[2])
# devices is the devices/value as tuples nested in a list
# percol is going to be how many devices per column
# you can len(devices) to count how many devices we have
# prints out the device list in reverse.
# print sorted(devices, key=lambda x: x[1], reverse=True)
# what we will need to do here is split the device list into number of desired slices. i.e. if we want 12 columns
# and we have 108 devices there should be 9 slices of 12.
# this will leave a remaining slice - of less than 9 which will be added to the 12 columns in order of smallest column first
devicesleft=devices
numcolumns=int(sys.argv[2])
for i in range(percol):
sendcol,devicesleft=devicesplit(devicesleft,numcolumns,i)
exec('deviceslice%s=sendcol' % i)
# and finally create the columns
for i in range(0,numcolumns):
sendcol=makecolumns(i,percol)
exec('column%s=sendcol' % i)
# add the left over devices
j=numcolumns
# sort remaining reverse.
devices=sorted(devicesleft, key=lambda device: device[1], reverse=True)
for i in range(len(devices)):
j-=1
exec('column%s.append(devices[i])' % j)
# prints out the resulting columns
for i in range(0,numcolumns):
exec('tempcol=column%s' % i)
print tempcol
print sum([pair[1] for pair in tempcol])
我跑过的测试文件。
file:test44a.csv
SQCIEOEO,1272
HIKTXYZH,281
JZHRZXKX,5793
UBGTOLUX,147
WBVYFNBN,9
VMHTKHBU,32
GILGFWDA,1334
YKUMWOKT,2066
PFSVTUIP,51
GPJRWKMD,673
TYJZUNZS,27
XTFUHPNX,2102
VFSPABFG,65
ROYOZKRS,189
IARDNRVL,587
LBFSQTQL,973
ZJBZKGFB,21301
UEPUOHMW,20
HEAVWVGH,0
XMANFQZE,719
ZADKGIMB,82
NCVBJIYR,27
NYMJUSQR,20646
EQFKHEOH,2050
ERRLAENN,19
HIPRQNIE,12557
MVNHODYT,20
UEDBIRIN,14
JAZJEMXL,28
UMDLALPN,36
GCUUGTNA,0
XRCGIKTR,12
KSBPEYBZ,20657
LELLPAYW,43792
DTRKMFLK,73
WNQEXJWI,41
CYXHXYHI,10
CSUSTTOX,120
NFHZLSJH,23
FAMDKJLM,25
HIUEHBNJ,261
UIBNCQKP,40
WSPHKYOQ,30025
ZBUJKFWR,0
OQWVSKFM,49
SHZUXKKU,21
CZBMYQDX,45
RXGBCCTR,17
SPMLASXS,15
ZWNXGXRI,59
WTVUJZSB,22
WYDZBWQU,19100
MDFMVCFV,6133
ZSSGQJPM,25
CKHMJZOG,85
YRFZOWTB,28
AYNWBSRA,14
LJGBTVOW,13110
GWJPWXWU,16
PCUDYNEY,179
MSVNLMOX,62
WUYPPNMW,2285
KVLGTIBI,11
KWMIKQHW,11
JDKUPYRM,1851
DARXQYDY,68
UUPXIDEP,139
SKQZMTFY,4377
ZEPOWAEA,189
BWXRVAPP,167
VFMDIRTA,561
BKANEGMD,2122
LBRICWID,1775
TGVOGLDC,3650
QQGZHAAJ,81
KAXPHJSS,122
LKAOHISA,32
ONOVZSYQ,41
IEPQEPZP,62
QWEXGXQS,0
IQGPZYQO,15
MEJLXIBG,10
MRWRHWHX,10
TMVAJLSS,57
BYIAXYOJ,173
DYUAGWGT,248
ODLVZSST,21
EOTOZLHA,6476
KPBHOQQR,30
OLSVIYOW,539
CZSCSLVX,17
ZPMYBTZL,11
IATWRKOF,12507
WGBEFQBH,41
PUJIFEFE,382
TSDULCGU,9070
DARUKFAG,209
MBLRRNYH,250
IIQNNWSG,25
OWBZYIUC,1808
ILXTRXZD,2012
ZLVRZUYH,269
CPVPLOWZ,108
KYZJGTMO,635
EJHWGHZG,25
TUXTOWBR,11
LXGXLCWW,2313
AVFHPRWT,915
AEPHMPNF,32
KLZZHAQT,56
XWQJZNFA,611
JKHYCDSC,1455
运行它的命令:python columnsplit.py test44a 12(12是所需列的数量)。
示例输出列,其值为第一列。
1) 45577 [('LELLPAYW', 43792), ('HEAVWVGH', 0), ('XRCGIKTR', 12), ('ODLVZSST', 21), ('VMHTKHBU', 32), ('TMVAJLSS', 57), ('KAXPHJSS', 122), ('ZLVRZUYH', 269), ('SQCIEOEO', 1272)]
2) 31906 [('WSPHKYOQ', 30025), ('GCUUGTNA', 0), ('UEDBIRIN', 14), ('WTVUJZSB', 22), ('LKAOHISA', 32), ('ZWNXGXRI', 59), ('UUPXIDEP', 139), ('HIKTXYZH', 281), ('GILGFWDA', 1334)]
3) 23416 [('ZJBZKGFB', 21301), ('ZBUJKFWR', 0), ('AYNWBSRA', 14), ('NFHZLSJH', 23), ('AEPHMPNF', 32), ('MSVNLMOX', 62), ('UBGTOLUX', 147), ('PUJIFEFE', 382), ('JKHYCDSC', 1455)]
4) 23276 [('KSBPEYBZ', 20657), ('QWEXGXQS', 0), ('SPMLASXS', 15), ('FAMDKJLM', 25), ('UMDLALPN', 36), ('IEPQEPZP', 62), ('BWXRVAPP', 167), ('OLSVIYOW', 539), ('LBRICWID', 1775)]
5) 23342 [('NYMJUSQR', 20646), ('WBVYFNBN', 9), ('IQGPZYQO', 15), ('ZSSGQJPM', 25), ('UIBNCQKP', 40), ('VFSPABFG', 65), ('BYIAXYOJ', 173), ('VFMDIRTA', 561), ('OWBZYIUC', 1808)]
6) 21877 [('WYDZBWQU', 19100), ('CYXHXYHI', 10), ('GWJPWXWU', 16), ('IIQNNWSG', 25), ('WNQEXJWI', 41), ('DARXQYDY', 68), ('PCUDYNEY', 179), ('IARDNRVL', 587), ('JDKUPYRM', 1851)]
7) 16088 [('LJGBTVOW', 13110), ('MEJLXIBG', 10), ('RXGBCCTR', 17), ('EJHWGHZG', 25), ('ONOVZSYQ', 41), ('DTRKMFLK', 73), ('ROYOZKRS', 189), ('XWQJZNFA', 611), ('ILXTRXZD', 2012)]
8) 15607 [('HIPRQNIE', 12557), ('MRWRHWHX', 10), ('CZSCSLVX', 17), ('TYJZUNZS', 27), ('WGBEFQBH', 41), ('QQGZHAAJ', 81), ('ZEPOWAEA', 189), ('KYZJGTMO', 635), ('EQFKHEOH', 2050)]
9) 17952 [('IATWRKOF', 12507), ('KVLGTIBI', 11), ('ERRLAENN', 19), ('NCVBJIYR', 27), ('CZBMYQDX', 45), ('ZADKGIMB', 82), ('DARUKFAG', 209), ('GPJRWKMD', 673), ('YKUMWOKT', 2066), ('LXGXLCWW', 2313)]
10) 15982 [('TSDULCGU', 9070), ('KWMIKQHW', 11), ('UEPUOHMW', 20), ('JAZJEMXL', 28), ('OQWVSKFM', 49), ('CKHMJZOG', 85), ('DYUAGWGT', 248), ('XMANFQZE', 719), ('XTFUHPNX', 2102), ('TGVOGLDC', 3650)]
11) 14358 [('EOTOZLHA', 6476), ('ZPMYBTZL', 11), ('MVNHODYT', 20), ('YRFZOWTB', 28), ('PFSVTUIP', 51), ('CPVPLOWZ', 108), ('MBLRRNYH', 250), ('AVFHPRWT', 915), ('BKANEGMD', 2122), ('SKQZMTFY', 4377)]
12) 15683 [('MDFMVCFV', 6133), ('TUXTOWBR', 11), ('SHZUXKKU', 21), ('KPBHOQQR', 30), ('KLZZHAQT', 56), ('CSUSTTOX', 120), ('HIUEHBNJ', 261), ('LBFSQTQL', 973), ('WUYPPNMW', 2285), ('JZHRZXKX', 5793)]