我有一个大的2d列表[[name, lower_limit, upper_limit], ...]
。
我想合并相同name
的项目列表。
例如,转换
a = [['a1', 1, 10],['a2', -1, 20],['a1', 0, 8], ['a2', 0, 1]]
到
[['a1', 0, 10], ['a2', -1, 20]]
合并相同name
的项目列表,并将其最小下限和最大上限分别作为合并的下限和上限。
答案 0 :(得分:3)
L = [['a1', 1, 10],['a2', -1, 20],['a1', 0, 8], ['a2', 0, 1]]
d = {}
for name, low, high in L:
if name not in d:
d[name] = [low, high]
continue
if low<d[name][0]: d[name][0] = low
if high>d[name][1]: d[name][1] = high
答案 1 :(得分:3)
我不确定最Pythonic的方式是什么,但是这是使用itertools.groupby()
执行此操作的方法:
from itertools import groupby
from operator import itemgetter
a = [['a1', 1, 10],['a2', -1, 20],['a1', 0, 8], ['a2', 0, 1]]
merged = []
for k, g in groupby(sorted(a), key=itemgetter(0)):
_, low_limits, high_limits = zip(*g)
merged.append([k, min(low_limits), max(high_limits)])
这通过键(第一个元素)对外部列表进行排序和分组,然后迭代,只查找下限列表中的最小值和上限列表中的最大值。
修改:根据@ JaredGoguen的建议进行清理。
第二次编辑由于OP似乎对性能感到担心,我会说在我看来,如果你有大量的这些按键使得性能成为问题,你可能会我希望考虑使用numpy
或pandas
这样的任务来完成此任务,但这种groupby方法不是可以扩展的方法。
我做了一些分析:
import numpy as np
import pandas as pd
from itertools import groupby
from operator import itemgetter
def merge_groupby(a):
merged = []
for k, g in groupby(sorted(a), key=itemgetter(0)):
_, low_limits, high_limits = zip(*g)
merged.append([k, min(low_limits), max(high_limits)])
return merged
def merge_g4dget(a):
d = {}
for name, low, high in a:
if name not in d:
d[name] = [low, high]
continue
if low<d[name][0]: d[name][0] = low
if high>d[name][1]: d[name][1] = high
def merge_pandas(a):
df = pd.DataFrame(a).set_index(0)
ndf = df.groupby(level=0).agg({1: np.min, 2:np.max})
return [[k, v[1], v[2]] for k, v in ndf.iterrows()]
if __name__ == "__main__":
# Construct a large array of these things
keys = ['a1', 'a2', 'a3', 'a4', 'a5', 'a6']
N = 1000000
get_randint = lambda: np.random.randint(-50, 50)
large_array = [[np.random.choice(keys), get_randint(), get_randint()]
for x in range(N)]
然后在IPython shell中:
In [1]: run -i groupby_demo.py
%load_ext line_profiler
In [2]: %load_ext line_profiler
In [3]: %lprun -f merge_groupby merge_groupby(large_array)
Timer unit: 1e-06 s
Total time: 7.01214 s
File: groupby_demo.py
Function: merge_groupby at line 7
Line # Hits Time Per Hit % Time Line Contents
==============================================================
7 def merge_groupby(a):
8 1 4 4.0 0.0 merged = []
9 7 4328680 618382.9 61.7 for k, g in groupby(sorted(a), key=itemgetter(0)):
10 6 2555118 425853.0 36.4 _, low_limits, high_limits = zip(*g)
11 6 128342 21390.3 1.8 merged.append([k, min(low_limits), max(high_limits)])
12
13 1 1 1.0 0.0 return merged
In [4]: %lprun -f merge_g4dget merge_g4dget(large_array)
Timer unit: 1e-06 s
Total time: 2.84788 s
File: groupby_demo.py
Function: merge_g4dget at line 15
Line # Hits Time Per Hit % Time Line Contents
==============================================================
15 def merge_g4dget(a):
16 1 5 5.0 0.0 d = {}
17 1000001 579263 0.6 20.3 for name, low, high in a:
18 1000000 668371 0.7 23.5 if name not in d:
19 6 11 1.8 0.0 d[name] = [low, high]
20 6 5 0.8 0.0 continue
21 999994 828477 0.8 29.1 if low<d[name][0]: d[name][0] = low
22 999994 771750 0.8 27.1 if high>d[name][1]: d[name][1] = high
In [5]: %lprun -f merge_pandas merge_pandas(large_array)
Timer unit: 1e-06 s
Total time: 0.662813 s
File: groupby_demo.py
Function: merge_pandas at line 24
Line # Hits Time Per Hit % Time Line Contents
==============================================================
24 def merge_pandas(a):
25 1 568868 568868.0 85.8 df = pd.DataFrame(a).set_index(0)
26 1 92455 92455.0 13.9 ndf = df.groupby(level=0).agg({1: np.min, 2:np.max})
27 1 1490 1490.0 0.2 return [[k, v[1], v[2]] for k, v in ndf.iterrows()]
从那看起来使用熊猫似乎是最快的,并且狮子的工作份额实际上是在Pandas数据帧的初始构建中完成的(如果你正在使用DataFrames或numpy数组而不是首先是列表清单,是一种固定成本。)
请注意,无论出于何种原因,这与%timeit
结果不一致:
In [6]: %timeit merge_pandas(large_array)
1 loops, best of 3: 619 ms per loop
In [7]: %timeit merge_g4dget(large_array)
1 loops, best of 3: 396 ms per loop
不确定原因,但似乎在通话或其他方面存在一些差异。无论哪种方式,如果您还有其他任务在pandas
中对数据执行得更好,那么您最好使用它。
答案 2 :(得分:1)
这是我清理保罗代码的意图(随意复制它,我会删除这个答案)。这对我来说似乎相对可读:
from itertools import groupby
from operator import itemgetter
a = [['a1', 1, 10], ['a2', -1, 20], ['a1', 0, 8], ['a2', 0, 1]]
merged = []
for key, groups in groupby(sorted(a), key=itemgetter(0)):
_, lowers, uppers = zip(*groups)
merged.append([key, min(lowers), max(uppers)])
但是,由于我们知道我们希望每个密钥只发生一次,因此我不会发现使用字典会造成任何损害。
merged = {}
for key, groups in groupby(sorted(a), key=itemgetter(0)):
_, lowers, uppers = zip(*groups)
merged[key] = (min(lowers), max(uppers))
答案 3 :(得分:0)
这是一种方式
a = [['a1', 1, 10],['a2', -1, 20],['a1', 0, 8], ['a2', 0, 1]]
b = [[key,min(el[1] for el in a if el[0] == key),max(el[2] for el in a if el[0] == key)] for key in set([el[0] for el in a])]
外部列表理解创建了一组键;内部列表推导使用内置的min / max方法将键与第一个和第二个bin中的max元素相关联