Question

我有一个x，y坐标列表，我需要根据x坐标进行排序，然后在y是x坐标时进行排序相同，并消除相同坐标的重复项。例如，如果列表为：

[[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0], 
 [349.9, 486.6], [450.0, 313.3]]

我需要将其重新排列为：

[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6],
 [500.0, 400.0]]

（已删除[450.0, 313.3]的一个副本）

Answer 1

无论如何，这是列表列表的正常排序顺序。用字典将其删除。

>>> L = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
>>> sorted({tuple(x): x for x in L}.values())
[[300.0, 400.0],
 [349.9, 486.6],
 [350.0, 313.3],
 [450.0, 313.3],
 [450.0, 486.6],
 [500.0, 400.0]]

Answer 2

无论如何，我们都可以使用groupby进行重复数据删除：

>>> import itertools
>>> [k for k, g in itertools.groupby(sorted(data))]                                                                 
[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]

一些时间：

>>> import numpy as np # just to create a large example
>>> a = np.random.randint(0, 215, (10000, 2)).tolist()
>>> len([k for k, g in groupby(sorted(a))])
8977 # ~ 10% duplicates
>>> 
>>> timeit("[k for k, g in groupby(sorted(a))]", globals=globals(), number=1000)
6.1627248489967315
>>> timeit("sorted({tuple(x): x for x in a}.values())", globals=globals(), number=1000)
6.654527607999626
>>> timeit("sorted(unique(a, key=tuple))", globals=globals(), number=1000)
7.198703720991034
>>> timeit("np.unique(a, axis=0).tolist()", globals=globals(), number=1000)
8.848866895001265

Answer 3

使用numpy的{{1}}函数可以轻松完成您想要的操作：

unique

如果您真的担心数组未按列排序，则除了执行上述操作外，还运行import numpy as np u = np.unique(data, axis=0) # or np.unique(data, axis=0).tolist()：

np.lexsort()

时间（非随机样本）：

u = u[np.lexsort((u[:,1], u[:,0]))]

时间（更多随机样本）：

当样本数据更加随机时，结果将有所不同：

In [1]: import numpy as np

In [2]: from toolz import unique

In [3]: data = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3],
   ...:  [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
   ...:  

In [4]: L = 100000 * data

In [5]: npL = np.array(L)

In [6]: %timeit sorted(unique(L, key=tuple))
125 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit sorted({tuple(x): x for x in L}.values())
139 ms ± 3.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit np.unique(L, axis=0)
732 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit np.unique(npL, axis=0)
584 ms ± 8.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# @user3483203 solution:

In [57]: %timeit lex(np.asarray(L))
227 ms ± 8.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [58]: %timeit lex(npL)
76.2 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answer 4

我们可以使用 np.lexsort 和一些遮罩

def lex(arr):                 
    tmp =  arr[np.lexsort(arr.T),:]
    tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
    return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]

L = np.array(L)
lex(L)

# Output:
[[300.  400. ]
 [349.9 486.6]
 [350.  313.3]
 [450.  313.3]
 [450.  486.6]
 [500.  400. ]]

性能

`Functions`

def chrisz(arr):                 
    tmp =  arr[np.lexsort(arr.T),:]
    tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
    return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]

def pp(data):
    return [k for k, g in itertools.groupby(sorted(data))]

def gazer(data):
    return np.unique(data, axis=0)

def wim(L):
    return sorted({tuple(x): x for x in L}.values())

def jpp(L):
    return sorted(unique(L, key=tuple))

`Setup`

res = pd.DataFrame(
       index=['chrisz', 'pp', 'gazer', 'wim', 'jpp'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        npL = np.random.randint(1,1000,(c,2)) + np.random.choice(np.random.random(1000), (c, 2))
        L = npL.tolist()
        stmt = '{}(npL)'.format(f) if f in {'chrisz', 'gazer'} else '{}(L)'.format(f)
        setp = 'from __main__ import L, npL, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

`Validation`

npL = np.random.randint(1,1000,(100000,2)) + np.random.choice(np.random.random(1000), (100000, 2))    
L = npL.tolist()    
chrisz(npL).tolist() == pp(L) == gazer(npL).tolist() == wim(L) == jpp(L)
True

Answer 5

这是使用sorted和toolz.unique的一种方法：

from toolz import unique

res = sorted(unique(L, key=tuple))

print(res)

[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3],
 [450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]

注释toolz.unique也可以通过标准库以itertools unique_everseen recipe的形式获得。元组转换是必需的，因为该算法通过set使用散列来检查唯一性。

在这里使用set的性能似乎要好于dict，但是一如往常，您应该对数据进行测试。

L = L*100000

%timeit sorted(unique(L, key=tuple))               # 223 ms
%timeit sorted({tuple(x): x for x in L}.values())  # 243 ms

我怀疑这是因为unique是惰性的，并且由于sorted不会对输入数据进行复制，因此您的内存开销较小。

Python：对列表进行数字排序

5 个答案:

时间（非随机样本）：

时间（更多随机样本）：

性能

`Functions`

`Setup`

`Validation`