我有5列的列表,在第5列中有一个数字列表,在第1列中有一个组标识符。共有500行,但只有24组。
我想要的是从第5列中编号最小的每个组标识符中只选择一行。
E.g。
sheet= """
cmn1\tcmn2\tcmn3\tcmn4\tcmn5
rob\t45\tfoo\tbar\t0.0001
Steve\t32\tfoo\tspam\t0.01
rob\t45\tbar\tfoo\t0.0000001
Steve\t32\tfoo\tbar\t0.1"""
这是理想的结果:
cmn1\tcmn2\tcmn3\tcmn4\tcmn5
Steve\t32\tfoo\tspam\t.01
rob\t45\tbar\tfoo\t0.0000001
我在每行的列表中得到了我的字段,但我仍然坚持如何选择部分中编号最小的行[4]
for line in sheet:
line = sheet.strip().split("\n")
parts = []
for part in line:
parts = []
parts = part.split("\t")
print parts [0], parts [1], parts[2], parts[3], parts[4]
答案 0 :(得分:2)
sheet= """ cmn1 cmn2 cmn3 cmn4 cmn5
rob 45 foo bar 0.0001
Steve 32 foo spam 0.01
rob 45 bar foo 0.0000001
Steve 32 foo bar 0.1"""
from collections import defaultdict
d = defaultdict(list)
spl = sheet.splitlines()
header = spl[0]
# iterate over all lines except header
for line in spl[1:]:
# split once on whitespace using name as the key
name = line.split(None,1)[0]
# append each line to our list of values
d[name].append(line)
# get min of each line in our values based on the last float value
for v in d.values():
print(min(v,key=lambda x: float(x.split()[-1])))
Steve 32 foo spam 0.01
rob 45 bar foo 0.0000001
如果订单有问题,您可以使用,OrderedDict广告也会随时检查:
from collections import OrderedDict
d = OrderedDict()
spl = sheet.splitlines()
header = spl[0]
for line in spl[1:]:
# unpack five elements after splitting
# using name as key and f to cast to float and compare
name, _, _, _, f = line.split()
# if key exists compare float value to current float value
# keeping or replacing the values based on the outcome
if name in d and float(d[name].split()[-1]) > float(f):
d[name] = line
# else if first time seeing name just add it
elif name not in d:
d[name] = line
print(header)
for v in d.values():
print(v)
cmn1 cmn2 cmn3 cmn4 cmn5
rob 45 bar foo 0.0000001
Steve 32 foo spam 0.01
使用您编辑过的线条,您可以看到输出未被更改,它将与原来完全一样:
for v in d.values():
print(repr(v))
'rob\t45\tbar\tfoo\t0.0000001'
'Steve\t32\tfoo\tspam\t0.01
答案 1 :(得分:1)
您可以使用itertools.groupby
根据第一项对您的分割线进行分组,然后使用min
函数和正确的key
来选择所需的行:
>>> from operator import itemgetter
>>> s=sorted((line.split() for line in sheet.strip().split('\n')[1:]),key=itemgetter(0))
>>> [' '.join(min(g,key=lambda x:float(x[4]))) for _,g in groupby(s,itemgetter(0))]
['Steve 32 foo spam 0.01', 'rob 45 bar foo 0.0000001']
答案 2 :(得分:0)
您可以使用字典存储每个唯一列1的所有行:
sheet= """cmn1\tcmn2\tcmn3\tcmn4\tcmn5
rob\t45\tfoo\tbar\t0.0001
Steve\t32\tfoo\tspam\t0.01
rob\t45\tbar\tfoo\t0.0000001
Steve\t32\tfoo\tbar\t0.1"""
grouped = {}
for line in sheet.split('\n')[1:]:
parts = line.split('\t')
print (line)
# Parse the numbers into numerical types
typed = (parts[0], int(parts[1]), parts[2], parts[3], float(parts[4]))
#Add the typed list of values into a list stored in our dict
if parts[0] in grouped.keys():
grouped[parts[0]].append(typed)
else:
grouped[parts[0]] = [typed]
#Now you can go through all the keys in the dict and select the smallest
smallest_per_group = []
for key in grouped:
lines = grouped[key]
# using the 'key' parameter tells Python to give us the line with the smallest 5th column
smallest = min(lines, key=lambda x:x[4])
smallest_per_group.append(smallest)