根据数量为每个类别选择1行

时间:2015-03-12 18:13:49

标签: python

我有5列的列表,在第5列中有一个数字列表,在第1列中有一个组标识符。共有500行,但只有24组。

我想要的是从第5列中编号最小的每个组标识符中只选择一行。

E.g。

sheet= """ 
cmn1\tcmn2\tcmn3\tcmn4\tcmn5
rob\t45\tfoo\tbar\t0.0001
Steve\t32\tfoo\tspam\t0.01
rob\t45\tbar\tfoo\t0.0000001
Steve\t32\tfoo\tbar\t0.1"""

这是理想的结果:

cmn1\tcmn2\tcmn3\tcmn4\tcmn5
Steve\t32\tfoo\tspam\t.01
rob\t45\tbar\tfoo\t0.0000001

我在每行的列表中得到了我的字段,但我仍然坚持如何选择部分中编号最小的行[4]

for line in sheet:
     line = sheet.strip().split("\n")

parts = []

for part in line: 
      parts = []
      parts = part.split("\t")
      print parts [0], parts [1], parts[2], parts[3], parts[4]

3 个答案:

答案 0 :(得分:2)

sheet= """ cmn1 cmn2 cmn3 cmn4 cmn5
rob  45   foo  bar  0.0001
Steve 32  foo  spam 0.01
rob   45  bar  foo  0.0000001
Steve 32  foo  bar  0.1"""

from collections import defaultdict

d = defaultdict(list)
spl = sheet.splitlines()
header = spl[0]
# iterate over all lines except header
for line in spl[1:]:
    # split once on whitespace using name as the key 
    name = line.split(None,1)[0]
    # append each line to our list of values
    d[name].append(line)

# get min of each line in our values based on the last float value
for v in d.values():
    print(min(v,key=lambda x: float(x.split()[-1])))

Steve 32  foo  spam 0.01
rob   45  bar  foo  0.0000001

如果订单有问题,您可以使用,OrderedDict广告也会随时检查:

from collections import OrderedDict

d = OrderedDict()
spl = sheet.splitlines()
header = spl[0]
for line in spl[1:]:
    # unpack five elements after splitting
    # using name as key and f to cast to float and compare
    name, _, _, _, f = line.split()
    # if key exists compare float value to current float value
    # keeping or replacing the values based on the outcome
    if name in d and float(d[name].split()[-1]) > float(f):
        d[name] = line
    # else if first time seeing name just add it
    elif name not in d:
        d[name] = line

print(header)
for v in d.values():
    print(v)

cmn1 cmn2 cmn3 cmn4 cmn5
rob   45  bar  foo  0.0000001
Steve 32  foo  spam 0.01

使用您编辑过的线条,您可以看到输出未被更改,它将与原来完全一样:

for v in d.values():
    print(repr(v))

'rob\t45\tbar\tfoo\t0.0000001'
'Steve\t32\tfoo\tspam\t0.01

答案 1 :(得分:1)

您可以使用itertools.groupby根据第一项对您的分割线进行分组,然后使用min函数和正确的key来选择所需的行:

>>> from operator import itemgetter
>>> s=sorted((line.split() for line in sheet.strip().split('\n')[1:]),key=itemgetter(0))
>>> [' '.join(min(g,key=lambda x:float(x[4]))) for _,g in groupby(s,itemgetter(0))]
['Steve 32 foo spam 0.01', 'rob 45 bar foo 0.0000001']

答案 2 :(得分:0)

您可以使用字典存储每个唯一列1的所有行:

sheet= """cmn1\tcmn2\tcmn3\tcmn4\tcmn5
rob\t45\tfoo\tbar\t0.0001
Steve\t32\tfoo\tspam\t0.01
rob\t45\tbar\tfoo\t0.0000001
Steve\t32\tfoo\tbar\t0.1"""

grouped = {}
for line in sheet.split('\n')[1:]:
  parts = line.split('\t')
  print (line)
  # Parse the numbers into numerical types
  typed = (parts[0], int(parts[1]), parts[2], parts[3], float(parts[4]))
  #Add the typed list of values into a list stored in our dict
  if parts[0] in grouped.keys():
    grouped[parts[0]].append(typed) 
  else:
    grouped[parts[0]] = [typed]

#Now you can go through all the keys in the dict and select the smallest  
smallest_per_group = []
for key in grouped:
  lines = grouped[key]
  # using the 'key' parameter tells Python to give us the line with the smallest 5th column
  smallest = min(lines, key=lambda x:x[4])
  smallest_per_group.append(smallest)