将文件按值分成相等的部分

时间:2012-05-15 15:59:33

标签: python bash

使用bash或python(2.4.x)

我有一个文件 - 文件中大约有100行,文件结构如下。

aaaaa,  100
aaaab,  75
aaaac,  150
aaaad,  135
aaaae,  144
aaaaf,  12
aaaag,  5
aaaah,  34
aaaai,  11
aaaaj,  43
aaaak,  88
aaaal,  3
baaaa,  25
baaab,  33
baaac,  87
baaad,  111
baaae,  45
baaaf,  99
baaag,  71
baaah,  68
baaai,  168
baaaj,  21
baaak,  11
baaal,  47
caaaa,  59
caaab,  85
caaac,  77
caaad,  33
caaae,  44
caaaf,  16
caaag,  111
caaah,  141
caaai,  87
caaaj,  59
caaak,  89
caaal,  3

我想做的是将它分成12列,每列的传感器数量大致相同,每列的总和接近相同。

换句话说,如果我采用上面的列表并将其拆分为此类。

aaaaa   100     aaaab   75      baaab   33
aaaai   11      baaah   68      baaac   87
aaaak   88      caaaa   59      caaac   77
       199             202              197

aaaah   34      baaaf   99      caaad   33
baaad   111     baaal   47      aaaac   150
aaaaj   43      caaae   44      caaaf   16
       188             190              199

aaaag   5       aaaaf   12      baaaa   25
aaaad   135     caaai   87      caaag   111
caaaa   59      caaak   89      baaag   71
       199                 188          207

aaaae   144     baaaj   21      caaaj   59
aaaal   3       baaak   11      caaah   141
baaae   45      baaai   168     caaal   3
       192              200              203

它产生12列相等的项目并且非常接近均值。

我可以手动完成,但我们最终需要这样做几次。我甚至不确定从哪里开始除了把它变成一个数组,计算数组中的项目并进行随机分割。仍然坚持价值平衡。

任何指针?

4 个答案:

答案 0 :(得分:3)

如果您想要最佳解决方案,这对于大型输入来说并不会很有趣。你正在寻找一些符合CS中一些非常着名的难题的东西 - KnapsackBin Packing等。一些更简单,不太完美的解决方案可能足够接近。

这不准确但是,鉴于您的示例数据集,我设法通过一种非常简单的方法获得214,197,194,199,205,182,195,192,199,199,206,208的大小。它可能适用也可能不适用于实际数据。

方法是:

  1. 按大小排序列表
  2. 将列表拆分为3个部分 - 高,中,低
  3. 将每个成员置于一个集合中。
  4. 反向中低名单。
  5. 将它们(按相反的顺序)放入现有的集合
  6. 随着您接近最佳分区,解决方案会变得更加复杂。

答案 1 :(得分:1)

有趣的问题,我认为你很难找到最好的解决方案。您可以计算每次拆分的项目数和它们应具有的平均值。对整数上的项目进行排序并获取最大数字,同时该值仍然低于平均值,然后重复此过程,直到您只需要再添加一个项目,现在选择最小项目并尝试尽可能接近平均值(过度或低于无关紧要)。

如果除了最新版本之外的任何步骤(例如价值>平均值),请返回并选择下一个最大值。

答案 2 :(得分:1)

我写了两个非常简单的实现。第一个使用双端队列从右侧和左侧弹出(一旦列表被排序),将低值放置为高值。第二个是@Sean McSomething建议的那个。

这是代码(quick'n'dirty - 遗憾的是几条评论):

import math
import itertools
import collections


def sum_column(data):
    return sum(zip(*data)[1], 0.0)


def split_groups(sensors):
    sensors.sort(key=lambda item: item[1], reverse=True)
    per_group = len(sensors) // 12
    average = sum_column(sensors) / len(sensors)
    data = collections.deque(sensors)
    groups = [[] for i in xrange(12)]
    cycle = itertools.cycle(groups)
    try:
        while True:
            current = cycle.next()
            if len(current) == per_group - 1:
                if sum_column(current) < average:
                    current.append(data.popleft())
                else:
                    current.append(data.pop())
                continue
            current.append(data.popleft())
            current.append(data.pop())
    except IndexError:
        return groups


def split_groups2(sensors):
    sensors.sort(key=lambda item: item[1], reverse=True)
    groups = [[] for i in xrange(12)]
    cycle = itertools.cycle(groups)
    per_group = int(math.ceil(len(sensors) / 3.))
    partitions = [sensors[i:i + per_group] for i in xrange(0, len(sensors)
                                                           per_group)]
    medium, low = map(reversed, partitions[1:])
    for sensor, value in itertools.chain(partitions[0], medium, low):
        cycle.next().append((sensor, value))
    return groups


def format_groups(result):
    ret = []
    for group in result:
        tmp = []
        tmp.append('\n'.join('{0}   {1}'.format(k, v) for k, v in group))
        tmp.append(' ' * 8 + str(int(sum_column(group))))
        ret.append('\n'.join(tmp))
    return '\n\n'.join(ret)


if __name__ == '__main__':
    import sys

    implementation = split_groups
    if '--second' in sys.argv:
        sys.argv.remove('--second')
        implementation = split_groups2

    with open(sys.argv[1]) as fobj:
        sensors = []
        for line in fobj:
            sensor, value = line.strip().split(',  ')
            sensors.append((sensor, int(value)))
        sys.stdout.write(format_groups(split_groups(sensors)))
        sys.stdout.write('\n')

在一个要点中:
https://gist.github.com/2703965

我把简单的部分(格式化)留给了你。现在它只是垂直打印(而不是你所要求的orizo​​ntally)。这应该不会太难。

这是它可以达到的最佳效果(两种实现方式):

(max(sums) - min(sums)) / 2. = 16.0

这远不是这个例子,但它是一个开始。您可以从命令行使用文件名和可选的--second开关启动它(以使用第二个实现)。我可以使用命令行解析器,但我习惯了argparse,这在Python 2.4中不存在。所以我只是去找那个尴尬的黑客。

示例运行:

$ python2 groupit.py filename.txt
baaai   168
caaal   3
aaaaj   43
        214

aaaac   150
aaaal   3
caaae   44
        197

aaaae   144
aaaag   5
baaae   45
        194

caaah   141
baaak   11
baaal   47
        199

aaaad   135
aaaai   11
caaaj   59
        205

baaad   111
aaaaf   12
caaaa   59
        182

caaag   111
caaaf   16
baaah   68
        195

aaaaa   100
baaaj   21
baaag   71
        192

baaaf   99
baaaa   25
aaaab   75
        199

caaak   89
caaad   33
caaac   77
        199

aaaak   88
baaab   33
caaab   85
        206

baaac   87
aaaah   34
caaai   87
        208
$ python2 groupit.py --second filename.txt
baaai   168
caaal   3
aaaaj   43
        214

aaaac   150
aaaal   3
caaae   44
        197

aaaae   144
aaaag   5
baaae   45
        194

caaah   141
baaak   11
baaal   47
        199

aaaad   135
aaaai   11
caaaj   59
        205

baaad   111
aaaaf   12
caaaa   59
        182

caaag   111
caaaf   16
baaah   68
        195

aaaaa   100
baaaj   21
baaag   71
        192

baaaf   99
baaaa   25
aaaab   75
        199

caaak   89
caaad   33
caaac   77
        199

aaaak   88
baaab   33
caaab   85
        206

baaac   87
aaaah   34
caaai   87
        208

通过问题中的示例,两个算法给出了相同的答案。如果您可以提供更多测试用例,我会尝试改进它们。我在Python 2.7上测试了脚本,因为我没有安装2.4 对不起,答案很长。

答案 3 :(得分:0)

这是我到目前为止所提出的。我认为它非常接近,没有真正高价值的东西抵消了一点 - 它很接近。

很高兴听到任何建议让它更加pythonic。

file:columnsplit.py

#!/usr/bin/python
import sys, operator

# usage
# columnsplit.py <filename> <#cols>
# columnsplit.py test.csv 12
#

#determine number of devices per column
def devicelisting(fulllist,percolumn):
  devicelist=[]
  fobj=open(fulllist,'r')
  for line in fobj:
    (key, val) = line.split(',')
    devicelist.append((key,int(val)))
  devicespercol=(len(devicelist)/int(percolumn))
  return(devicelist,devicespercol)

def devicesplit(fulllist,numcolumns,roundnum):
  if roundnum == 0:
    devices=sorted(fulllist, key=lambda device: device[1], reverse=True)
    devicestemp=devices
  else:
    devices=sorted(fulllist, key=lambda device: device[1])
    devicestemp=devices
  deviceslice=[]
  for idx, val in zip(range(numcolumns), devices):
    deviceslice.append(val)
    devicestemp.remove(val)
  return(deviceslice,devicestemp)

def makecolumns(roundnumber,percol):
  column=[]
  for i in range(percol):
    exec('tempslice=deviceslice%s' % i)
    column.append(tempslice[roundnumber])
  return(column)

# what this is going to do is generate how many devices will fill each of the intended
# number of columns.  What is left over will be run again against the lowest value of columns

if __name__ == '__main__':
  tempslice=[]
  devices,percol=devicelisting(sys.argv[1],sys.argv[2])
  # devices is the devices/value as tuples nested in a list
  # percol is going to be how many devices per column
  # you can len(devices) to count how many devices we have

  # prints out the device list in reverse.
  # print sorted(devices, key=lambda x: x[1], reverse=True)

  # what we will need to do here is split the device list into number of desired slices.  i.e. if we want 12 columns
  # and we have 108 devices there should be 9 slices of 12.
  # this will leave a remaining slice - of less than 9 which will be added to the 12 columns in order of smallest column first

  devicesleft=devices
  numcolumns=int(sys.argv[2])
  for i in range(percol):
    sendcol,devicesleft=devicesplit(devicesleft,numcolumns,i)
    exec('deviceslice%s=sendcol' % i)
# and finally create the columns
  for i in range(0,numcolumns):
    sendcol=makecolumns(i,percol)
    exec('column%s=sendcol' % i)

  # add the left over devices
  j=numcolumns
  # sort remaining reverse.
  devices=sorted(devicesleft, key=lambda device: device[1], reverse=True)
  for i in range(len(devices)):
    j-=1
    exec('column%s.append(devices[i])' % j)

  # prints out the resulting columns
  for i in range(0,numcolumns):
    exec('tempcol=column%s' % i)
    print tempcol
    print sum([pair[1] for pair in tempcol])

我跑过的测试文件。

file:test44a.csv

SQCIEOEO,1272
HIKTXYZH,281
JZHRZXKX,5793
UBGTOLUX,147
WBVYFNBN,9
VMHTKHBU,32
GILGFWDA,1334
YKUMWOKT,2066
PFSVTUIP,51
GPJRWKMD,673
TYJZUNZS,27
XTFUHPNX,2102
VFSPABFG,65
ROYOZKRS,189
IARDNRVL,587
LBFSQTQL,973
ZJBZKGFB,21301
UEPUOHMW,20
HEAVWVGH,0
XMANFQZE,719
ZADKGIMB,82
NCVBJIYR,27
NYMJUSQR,20646
EQFKHEOH,2050
ERRLAENN,19
HIPRQNIE,12557
MVNHODYT,20
UEDBIRIN,14
JAZJEMXL,28
UMDLALPN,36
GCUUGTNA,0
XRCGIKTR,12
KSBPEYBZ,20657
LELLPAYW,43792
DTRKMFLK,73
WNQEXJWI,41
CYXHXYHI,10
CSUSTTOX,120
NFHZLSJH,23
FAMDKJLM,25
HIUEHBNJ,261
UIBNCQKP,40
WSPHKYOQ,30025
ZBUJKFWR,0
OQWVSKFM,49
SHZUXKKU,21
CZBMYQDX,45
RXGBCCTR,17
SPMLASXS,15
ZWNXGXRI,59
WTVUJZSB,22
WYDZBWQU,19100
MDFMVCFV,6133
ZSSGQJPM,25
CKHMJZOG,85
YRFZOWTB,28
AYNWBSRA,14
LJGBTVOW,13110
GWJPWXWU,16
PCUDYNEY,179
MSVNLMOX,62
WUYPPNMW,2285
KVLGTIBI,11
KWMIKQHW,11
JDKUPYRM,1851
DARXQYDY,68
UUPXIDEP,139
SKQZMTFY,4377
ZEPOWAEA,189
BWXRVAPP,167
VFMDIRTA,561
BKANEGMD,2122
LBRICWID,1775
TGVOGLDC,3650
QQGZHAAJ,81
KAXPHJSS,122
LKAOHISA,32
ONOVZSYQ,41
IEPQEPZP,62
QWEXGXQS,0
IQGPZYQO,15
MEJLXIBG,10
MRWRHWHX,10
TMVAJLSS,57
BYIAXYOJ,173
DYUAGWGT,248
ODLVZSST,21
EOTOZLHA,6476
KPBHOQQR,30
OLSVIYOW,539
CZSCSLVX,17
ZPMYBTZL,11
IATWRKOF,12507
WGBEFQBH,41
PUJIFEFE,382
TSDULCGU,9070
DARUKFAG,209
MBLRRNYH,250
IIQNNWSG,25
OWBZYIUC,1808
ILXTRXZD,2012
ZLVRZUYH,269
CPVPLOWZ,108
KYZJGTMO,635
EJHWGHZG,25
TUXTOWBR,11
LXGXLCWW,2313
AVFHPRWT,915
AEPHMPNF,32
KLZZHAQT,56
XWQJZNFA,611
JKHYCDSC,1455
运行它的

命令:python columnsplit.py test44a 12(12是所需列的数量)。

示例输出列,其值为第一列。

1) 45577 [('LELLPAYW', 43792), ('HEAVWVGH', 0), ('XRCGIKTR', 12), ('ODLVZSST', 21), ('VMHTKHBU', 32), ('TMVAJLSS', 57), ('KAXPHJSS', 122), ('ZLVRZUYH', 269), ('SQCIEOEO', 1272)]

2) 31906 [('WSPHKYOQ', 30025), ('GCUUGTNA', 0), ('UEDBIRIN', 14), ('WTVUJZSB', 22), ('LKAOHISA', 32), ('ZWNXGXRI', 59), ('UUPXIDEP', 139), ('HIKTXYZH', 281), ('GILGFWDA', 1334)]

3) 23416 [('ZJBZKGFB', 21301), ('ZBUJKFWR', 0), ('AYNWBSRA', 14), ('NFHZLSJH', 23), ('AEPHMPNF', 32), ('MSVNLMOX', 62), ('UBGTOLUX', 147), ('PUJIFEFE', 382), ('JKHYCDSC', 1455)]

4) 23276 [('KSBPEYBZ', 20657), ('QWEXGXQS', 0), ('SPMLASXS', 15), ('FAMDKJLM', 25), ('UMDLALPN', 36), ('IEPQEPZP', 62), ('BWXRVAPP', 167), ('OLSVIYOW', 539), ('LBRICWID', 1775)]

5) 23342 [('NYMJUSQR', 20646), ('WBVYFNBN', 9), ('IQGPZYQO', 15), ('ZSSGQJPM', 25), ('UIBNCQKP', 40), ('VFSPABFG', 65), ('BYIAXYOJ', 173), ('VFMDIRTA', 561), ('OWBZYIUC', 1808)]

6) 21877 [('WYDZBWQU', 19100), ('CYXHXYHI', 10), ('GWJPWXWU', 16), ('IIQNNWSG', 25), ('WNQEXJWI', 41), ('DARXQYDY', 68), ('PCUDYNEY', 179), ('IARDNRVL', 587), ('JDKUPYRM', 1851)]

7) 16088 [('LJGBTVOW', 13110), ('MEJLXIBG', 10), ('RXGBCCTR', 17), ('EJHWGHZG', 25), ('ONOVZSYQ', 41), ('DTRKMFLK', 73), ('ROYOZKRS', 189), ('XWQJZNFA', 611), ('ILXTRXZD', 2012)]

8) 15607 [('HIPRQNIE', 12557), ('MRWRHWHX', 10), ('CZSCSLVX', 17), ('TYJZUNZS', 27), ('WGBEFQBH', 41), ('QQGZHAAJ', 81), ('ZEPOWAEA', 189), ('KYZJGTMO', 635), ('EQFKHEOH', 2050)]

9) 17952 [('IATWRKOF', 12507), ('KVLGTIBI', 11), ('ERRLAENN', 19), ('NCVBJIYR', 27), ('CZBMYQDX', 45), ('ZADKGIMB', 82), ('DARUKFAG', 209), ('GPJRWKMD', 673), ('YKUMWOKT', 2066), ('LXGXLCWW', 2313)]

10) 15982 [('TSDULCGU', 9070), ('KWMIKQHW', 11), ('UEPUOHMW', 20), ('JAZJEMXL', 28), ('OQWVSKFM', 49), ('CKHMJZOG', 85), ('DYUAGWGT', 248), ('XMANFQZE', 719), ('XTFUHPNX', 2102), ('TGVOGLDC', 3650)]

11) 14358 [('EOTOZLHA', 6476), ('ZPMYBTZL', 11), ('MVNHODYT', 20), ('YRFZOWTB', 28), ('PFSVTUIP', 51), ('CPVPLOWZ', 108), ('MBLRRNYH', 250), ('AVFHPRWT', 915), ('BKANEGMD', 2122), ('SKQZMTFY', 4377)]

12) 15683 [('MDFMVCFV', 6133), ('TUXTOWBR', 11), ('SHZUXKKU', 21), ('KPBHOQQR', 30), ('KLZZHAQT', 56), ('CSUSTTOX', 120), ('HIUEHBNJ', 261), ('LBFSQTQL', 973), ('WUYPPNMW', 2285), ('JZHRZXKX', 5793)]