我有一个嵌套列表,例如:
[[5117, 1556658900, u'29.3'], [5117, 1556659200, u'29.2'], [5117, 1556659500, u'29.0'],
[67097, 1556658900, u'28.61'], [67097, 1556659200, u'28.5'], [67097, 1556659500, u'28.44'],
[69370, 1556658900, u'30.0'], [69370, 1556659200, u'29.90'], [69370, 1556659500, u'29.94']]
我想返回一个修改后的嵌套列表,其中包含每个元素中每个唯一标识符的条目(每个元素的第一个整数)。要保留的条目应与第二个值的最大值相对应。
例如,我想返回:
[[5117, 1556659500, u'29.0'], [67097, 1556659500, u'28.44'],[69370, 1556659500, u'29.94']]
是否可以通过itertools
或其他方式来实现此目的?
第二个值的最大值可能并不总是与标识符组的最后一个条目相对应。
答案 0 :(得分:3)
在这种情况下,尽管它不是最有效的,但我更希望使用一种简单直接的解决方案:
from random import randint
# data = [[5117, 1556658900, u'29.3'], [5117, 1556659200, u'29.2'], [5117, 1556659500, u'29.0'],
# [67097, 1556658900, u'28.61'], [67097, 1556659200, u'28.5'], [67097, 1556659500, u'28.44'],
# [69370, 1556658900, u'30.0'], [69370, 1556659200, u'29.90'], [69370, 1556659500, u'29.94']]
data = [[randint(0, 1000), randint(0, 10000), str(randint(0, 100))] for _ in range(1000000)]
def max_recs(recs, identifier, value):
results = {}
for rec in recs:
if rec[identifier] not in results or rec[value] > results[rec[identifier]][value]:
results[rec[identifier]] = rec
return list(results.values())
def max_recs_fixed(recs):
results = {}
for rec in recs:
if rec[0] not in results or rec[1] > results[rec[0]][1]:
results[rec[0]] = rec
return list(results.values())
print(max_recs(data, 0, 1))
print(max_recs_fixed(data))
尽管这会对原始列表中的元素进行许多临时引用,然后在以后使用具有较高值的元素覆盖这些元素,但我觉得这不是效率问题。
主要代价将是在许多重复的字典查找中,鉴于问题的性质,这很难避免。但是您正在利用Python本身非常有效的实现。
如果您不希望能够告诉函数identifier
和value
使用哪个索引,您会发现max_recs_fixed
比{{ 1}}。使用一百万个随机生成的记录的配置文件,它平均快了5%。
由于OP似乎更喜欢简单,因此这是最小的:
max_recs
由于存在许多截然不同的结果,并且对速度有一些要求,因此下面的一些代码可以对它们进行合理的比较,您可以使用cProfile或类似工具来比较性能:
data = [[5117, 1556658900, u'29.3'], [5117, 1556659200, u'29.2'], [5117, 1556659500, u'29.0'],
[67097, 1556658900, u'28.61'], [67097, 1556659200, u'28.5'], [67097, 1556659500, u'28.44'],
[69370, 1556658900, u'30.0'], [69370, 1556659200, u'29.90'], [69370, 1556659500, u'29.94']]
results = {}
for rec in data:
if rec[0] not in results or rec[1] > results[rec[0]][1]:
results[rec[0]] = rec
print(list(results.values()))
运行cProfile,我得到了运行它们的结果。
from itertools import groupby
from operator import itemgetter
from random import randint
from collections import defaultdict
# generate a random set of 1000000 items matching the example format
data = [[randint(0, 1000), randint(0, 10000), str(randint(0, 100))] for _ in range(1000000)]
def max_recs(recs):
results = {}
for rec in recs:
if rec[0] not in results or rec[1] > results[rec[0]][1]:
results[rec[0]] = rec
return list(results.values())
def convert(lst):
biggest = defaultdict(int)
for ident, value, _ in lst:
if value > biggest[ident]:
biggest[ident] = value
return list(filter(lambda l: l[1] == biggest[l[0]], lst))
def process_list(l):
d = {}
for item in l:
key = item[0]
if key in d:
if item[1] > d[key][1]:
d[key] = item
else:
d[key] = item
return list(d.values())
def naive(l):
temp = []
temp2 = []
li = sorted(l, key=lambda x: x[1], reverse=True)
for i in li:
if i[0] not in temp:
temp2.append(i)
temp.append(i[0])
return temp2
def another(recs):
return [
max(g, key=itemgetter(1))
for k, g in groupby(sorted(recs), key=itemgetter(0))
]
max_recs_res = max_recs(data)
convert_res = convert(data)
process_list_res = process_list(data)
naive_res = naive(data)
another_res = another(data)
def cs(result):
# return a set of id, value combinations of a result, for comparison
return {(i, v) for i, v, _ in result}
# check that all results have the same id, value combinations (they do)
assert cs(max_recs_res) == cs(convert_res) == cs(process_list_res) == cs(naive_res) == cs(another_res)
# check that all results have the same number of solutions (convert_res includes *duplicate* id, val combinations!)
assert len(max_recs_res) == len(process_list_res) == len(naive_res) == len(another_res) # == len(convert_res)
运行少于1000条记录根本不会产生有意义的值-如果单独计时它们,结果会有很大差异,因此将1000 100条记录运行时间合并在一起。但是总的来说,它们的运行速度相当快,几乎可以立即执行。
对于较大的数据集,结果非常清晰,除了+-----------------------------------------------------------------------------------+
| Records | max_recs | convert | process_list | naive | another | notes |
| 1000x100 | 14ms | 49ms | 13ms | 110ms | 83ms | |
| 1x10000 | 2ms | 4ms | 2ms | 80ms | 9ms | 3x rounded avg |
| 1x1000000 | 216ms | 400ms | 258ms | 7234ms | 2416ms | 3x rounded avg |
+-----------------------------------------------------------------------------------+
和naive
呈指数比例增长外,大多数算法都随数据集大小线性增长。 (如果有人想分析并提供确切的订单,请成为我的客人)
答案 1 :(得分:2)
期望的结果将在rm -r startbootstrap-grayscale
中。
d.values()
或者这样:
d = {}
for r in rows:
if r[1] > d.get(r[1], (0, 0, 0))[1]:
d[r[0]] = r
答案 2 :(得分:1)
这里是一个遍历整个列表的解决方案。您可以使用字典来跟踪每个标识符的最高条目。
df[new_col] = df[col_A] | df[col_B]
答案 3 :(得分:1)
我的方法是对每个标识符的最大值进行映射,然后使用该映射过滤列表:
from collections import defaultdict
def convert(lst):
biggest = defaultdict(int)
for ident, value, _ in lst:
if value > biggest[ident]:
biggest[ident] = value
return [l for l in lst if l[1] == biggest[l[0]]]
lst = [[5117, 1556658900, u'29.3'], [5117, 1556659200, u'29.2'], [5117, 1556659500, u'29.0'],
[67097, 1556658900, u'28.61'], [67097, 1556659200, u'28.5'], [67097, 1556659500, u'28.44'],
[69370, 1556658900, u'30.0'], [69370, 1556659200, u'29.90'], [69370, 1556659500, u'29.94']]
print(convert(lst))
# output: [[5117, 1556659500, '29.0'], [67097, 1556659500, '28.44'], [69370, 1556659500, '29.94']]
经过一番思考,我重新编写了上面的代码:
lst = [[5117, 1556658900, u'29.3'], [5117, 1556659200, u'29.2'], [5117, 1556659500, u'29.0'],
[67097, 1556658900, u'28.61'], [67097, 1556659200, u'28.5'], [67097, 1556659500, u'28.44'],
[69370, 1556658900, u'30.0'], [69370, 1556659200, u'29.90'], [69370, 1556659500, u'29.94']]
new_list = [max(g, key=lambda l: l[1]) for _, g in groupby(lst, key=lambda l: l[0])]
答案 4 :(得分:0)
这可能很幼稚:
li = [[5117, 1556658900, u'29.3'], [5117, 1556659200, u'29.2'], [5117, 1556659500, u'29.0'],
[67097, 1556658900, u'28.61'], [67097, 1556659200, u'28.5'], [67097, 1556659500, u'28.44'],
[69370, 1556658900, u'30.0'], [69370, 1556659200, u'29.90'], [69370, 1556659500, u'29.94']]
temp = []
temp2 = []
li = sorted(li, key=lambda x: x[1], reverse=True)
for i in li:
if i[0] not in temp:
temp2.append(i)
temp.append(i[0])
print(temp2)
但是您可以对其进行修改以使其高效。
编辑:
我会说,这不是最佳答案,因此op应该选择其他人 (因此,将来会有其他人获得帮助)。