我有一个x
,y
坐标列表,我需要根据x
坐标进行排序,然后在y
是x
坐标时进行排序相同,并消除相同坐标的重复项。例如,如果列表为:
[[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0],
[349.9, 486.6], [450.0, 313.3]]
我需要将其重新排列为:
[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6],
[500.0, 400.0]]
(已删除[450.0, 313.3]
的一个副本)
答案 0 :(得分:5)
无论如何,这是列表列表的正常排序顺序。用字典将其删除。
>>> L = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
>>> sorted({tuple(x): x for x in L}.values())
[[300.0, 400.0],
[349.9, 486.6],
[350.0, 313.3],
[450.0, 313.3],
[450.0, 486.6],
[500.0, 400.0]]
答案 1 :(得分:2)
无论如何,我们都可以使用groupby
进行重复数据删除:
>>> import itertools
>>> [k for k, g in itertools.groupby(sorted(data))]
[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]
一些时间:
>>> import numpy as np # just to create a large example
>>> a = np.random.randint(0, 215, (10000, 2)).tolist()
>>> len([k for k, g in groupby(sorted(a))])
8977 # ~ 10% duplicates
>>>
>>> timeit("[k for k, g in groupby(sorted(a))]", globals=globals(), number=1000)
6.1627248489967315
>>> timeit("sorted({tuple(x): x for x in a}.values())", globals=globals(), number=1000)
6.654527607999626
>>> timeit("sorted(unique(a, key=tuple))", globals=globals(), number=1000)
7.198703720991034
>>> timeit("np.unique(a, axis=0).tolist()", globals=globals(), number=1000)
8.848866895001265
答案 2 :(得分:2)
使用numpy
的{{1}}函数可以轻松完成您想要的操作:
unique
如果您真的担心数组未按列排序,则除了执行上述操作外,还运行import numpy as np
u = np.unique(data, axis=0) # or np.unique(data, axis=0).tolist()
:
np.lexsort()
u = u[np.lexsort((u[:,1], u[:,0]))]
当样本数据更加随机时,结果将有所不同:
In [1]: import numpy as np
In [2]: from toolz import unique
In [3]: data = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3],
...: [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
...:
In [4]: L = 100000 * data
In [5]: npL = np.array(L)
In [6]: %timeit sorted(unique(L, key=tuple))
125 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit sorted({tuple(x): x for x in L}.values())
139 ms ± 3.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit np.unique(L, axis=0)
732 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]: %timeit np.unique(npL, axis=0)
584 ms ± 8.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# @user3483203 solution:
In [57]: %timeit lex(np.asarray(L))
227 ms ± 8.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [58]: %timeit lex(npL)
76.2 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
答案 3 :(得分:2)
我们可以使用 np.lexsort
和一些遮罩
def lex(arr):
tmp = arr[np.lexsort(arr.T),:]
tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]
L = np.array(L)
lex(L)
# Output:
[[300. 400. ]
[349.9 486.6]
[350. 313.3]
[450. 313.3]
[450. 486.6]
[500. 400. ]]
Functions
def chrisz(arr):
tmp = arr[np.lexsort(arr.T),:]
tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]
def pp(data):
return [k for k, g in itertools.groupby(sorted(data))]
def gazer(data):
return np.unique(data, axis=0)
def wim(L):
return sorted({tuple(x): x for x in L}.values())
def jpp(L):
return sorted(unique(L, key=tuple))
Setup
res = pd.DataFrame(
index=['chrisz', 'pp', 'gazer', 'wim', 'jpp'],
columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000],
dtype=float
)
for f in res.index:
for c in res.columns:
npL = np.random.randint(1,1000,(c,2)) + np.random.choice(np.random.random(1000), (c, 2))
L = npL.tolist()
stmt = '{}(npL)'.format(f) if f in {'chrisz', 'gazer'} else '{}(L)'.format(f)
setp = 'from __main__ import L, npL, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
Validation
npL = np.random.randint(1,1000,(100000,2)) + np.random.choice(np.random.random(1000), (100000, 2))
L = npL.tolist()
chrisz(npL).tolist() == pp(L) == gazer(npL).tolist() == wim(L) == jpp(L)
True
答案 4 :(得分:0)
这是使用sorted
和toolz.unique
的一种方法:
from toolz import unique
res = sorted(unique(L, key=tuple))
print(res)
[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3],
[450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]
注释toolz.unique
也可以通过标准库以itertools
unique_everseen
recipe的形式获得。元组转换是必需的,因为该算法通过set
使用散列来检查唯一性。
在这里使用set
的性能似乎要好于dict
,但是一如往常,您应该对数据进行测试。
L = L*100000
%timeit sorted(unique(L, key=tuple)) # 223 ms
%timeit sorted({tuple(x): x for x in L}.values()) # 243 ms
我怀疑这是因为unique
是惰性的,并且由于sorted
不会对输入数据进行复制,因此您的内存开销较小。