我有两个lex-sorted的数组。
In [2]: a = np.array([1,1,1,2,2,3,5,6,6])
In [3]: b = np.array([10,20,30,5,10,100,10,30,40])
In [4]: ind = np.lexsort((b, a)) # sorts elements first by a and then by b
In [5]: print a[ind]
[1 1 1 2 2 3 5 6 6]
In [7]: print b[ind]
[ 10 20 30 5 10 100 10 30 40]
我想对(2,7)和(5,150)进行二分搜索,期待(4,7)作为答案。
In [6]: np.lexsearchsorted((a,b), ([2, 5], [7,150]))
我们有searchsorted函数,但只适用于1D数组。
答案 0 :(得分:1)
编辑:已编辑以反映评论。
def comp_leq(t1,t2):
if (t1[0] > t2[0]) or ((t1[0] == t2[0]) and (t1[1] > t2[1])):
return 0
else:
return 1
def bin_search(L,item):
from math import floor
x = L[:]
while len(x) > 1:
index = int(floor(len(x)/2) - 1)
#Check item
if comp_leq(x[index], item):
x = x[index+1:]
else:
x = x[:index+1]
out = L.index(x[0])
#If greater than all
if item >= L[-1]:
return len(L)
else:
return out
def lexsearch(a,b,items):
z = zip(a,b)
return [bin_search(z,item) for item in items]
if __name__ == '__main__':
a = [1,1,1,2,2,3,5,6,6]
b = [10,20,30,5,10,100,10,30,40]
print lexsearch(a,b,([2,7],[5,150])) #prints [4,7]
答案 1 :(得分:1)
这段代码似乎是针对一组(完全)2个lexsorted数组
如果您创建一组values[-1]
,并且创建一个带有边界的字典,您可能会更快。
我没有检查过发布的其他情况,所以请确认它没有被窃听。
def lexsearchsorted_2(arrays, values, side='left'):
assert len(arrays) == 2
assert (np.lexsort(arrays) == range(len(arrays[0]))).all()
# here it will be faster to work on all equal values in 'values[-1]' in one time
boundries_l = np.searchsorted(arrays[-1], values[-1], side='left')
boundries_r = np.searchsorted(arrays[-1], values[-1], side='right')
# a recursive definition here will make it work for more than 2 lexsorted arrays
return tuple([boundries_l[i] +
np.searchsorted(arrays[-2[boundries_l[i]:boundries_r[i]],
values[-2][i],
side=side)
for i in range(len(boundries_l))])
用法:
import numpy as np
a = np.array([1,1,1,2,2,3,5,6,6])
b = np.array([10,20,30,5,10,100,10,30,40])
lexsearchsorted_2((b, a), ([7,150], [2, 5])) # return (4, 7)
答案 2 :(得分:0)
我遇到了同样的问题并想出了一个不同的解决方案。您可以使用结构化数据类型将多列数据视为单个条目。结构化数据类型将允许对数据使用argsort / sort(而不是lexsort,尽管lexsort在此阶段显得更快),然后使用标准searchsorted。这是一个例子:
import numpy as np
from itertools import repeat
# Setup our input data
# Every row is an entry, every column what we want to sort by
# Unlike lexsort, this takes columns in decreasing priority, not increasing
a = np.array([1,1,1,2,2,3,5,6,6])
b = np.array([10,20,30,5,10,100,10,30,40])
data = np.transpose([a,b])
# Sort the data
data = data[np.lexsort(data.T[::-1])]
# Convert to a structured data-type
dt = np.dtype(zip(repeat(''), repeat(data.dtype, data.shape[1]))) # the structured dtype
data = np.ascontiguousarray(data).view(dt).squeeze(-1) # the dtype change leaves a trailing 1 dimension, ascontinguousarray is required for the dtype change
# You can also first convert to the structured data-type with the two lines above then use data.sort()/data.argsort()/np.sort(data)
# Search the data
values = np.array([(2,7),(5,150)], dtype=dt) # note: when using structured data types the rows must be a tuple
pos = np.searchsorted(data, values)
# pos is (4,7) in this example, exactly what you would want
这适用于任意数量的列,使用内置的numpy函数,列保留在"逻辑"订单(降低优先级),它应该非常快。
A按时间比较了两种基于numpy的方法。
#1是来自@ j0ker5的递归方法(下面的一个方法扩展了他的例子,提出了递归的建议并使用任意数量的lexsorted行)
#2是我的结构化数组
它们都采用相同的输入,基本上类似于searchsorted
,a
和v
除lexsort
之外。
import numpy as np
def lexsearch1(a, v, side='left', sorter=None):
def _recurse(a, v):
if a.shape[1] == 0: return 0
if a.shape[0] == 1: return a.squeeze(0).searchsorted(v.squeeze(0), side)
bl = np.searchsorted(a[-1,:], v[-1], side='left')
br = np.searchsorted(a[-1,:], v[-1], side='right')
return bl + _recurse(a[:-1,bl:br], v[:-1])
a,v = np.asarray(a), np.asarray(v)
if v.ndim == 1: v = v[:,np.newaxis]
assert a.ndim == 2 and v.ndim == 2 and a.shape[0] == v.shape[0] and a.shape[0] > 1
if sorter is not None: a = a[:,sorter]
bl = np.searchsorted(a[-1,:], v[-1,:], side='left')
br = np.searchsorted(a[-1,:], v[-1,:], side='right')
for i in xrange(len(bl)): bl[i] += _recurse(a[:-1,bl[i]:br[i]], v[:-1,i])
return bl
def lexsearch2(a, v, side='left', sorter=None):
from itertools import repeat
a,v = np.asarray(a), np.asarray(v)
if v.ndim == 1: v = v[:,np.newaxis]
assert a.ndim == 2 and v.ndim == 2 and a.shape[0] == v.shape[0] and a.shape[0] > 1
a_dt = np.dtype(zip(repeat(''), repeat(a.dtype, a.shape[0])))
v_dt = np.dtype(zip(a_dt.names, repeat(v.dtype, a.shape[0])))
a = np.asfortranarray(a[::-1,:]).view(a_dt).squeeze(0)
v = np.asfortranarray(v[::-1,:]).view(v_dt).squeeze(0)
return a.searchsorted(v, side, sorter).ravel()
a = np.random.randint(100, size=(2,10000)) # Values to sort, rows in increasing priority
v = np.random.randint(100, size=(2,10000)) # Values to search for, rows in increasing priority
sorted_idx = np.lexsort(a)
a_sorted = a[:,sorted_idx]
时间结果(在iPython中):
# 2 rows
%timeit lexsearch1(a_sorted, v)
10 loops, best of 3: 33.4 ms per loop
%timeit lexsearch2(a_sorted, v)
100 loops, best of 3: 14 ms per loop
# 10 rows
%timeit lexsearch1(a_sorted, v)
10 loops, best of 3: 103 ms per loop
%timeit lexsearch2(a_sorted, v)
100 loops, best of 3: 14.7 ms per loop
总体而言,结构化阵列方法更快,如果您将其设计为与a
和v
的翻转和转置版本一起使用,则可以更快。随着行/键数量的增加,它变得更快,从2行变为10行时几乎没有减速。
我没有注意到使用a_sorted
或a和sorter=sorted_idx
之间存在任何明显的时间差异,所以为了清楚起见,我将其留下了。
我相信使用Cython可以实现一个非常快速的方法,但这与纯粹的纯Python和numpy一样快。