我希望快速找到两个值之间唯一值的索引(在这种情况下为epoch Times),仅返回minVal,maxVal之间的所有值(但不返回两次)。下面是一个简化的示例:
import numpy as np
minVal = 198000
maxVal = 230000
uniqueExample = np.arange(300, dtype=float) # this is how it expected to exist
# this is how it actually exists, a small repeated values randomly interspersed
example = np.insert(uniqueExample, 200, np.arange(200,210.))*1000 # *1000 to differentiate from the indices
# now begin process of isolating
mask = (example < maxVal) & (example > minVal)
idx = np.argwhere(mask).squeeze()
array([199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211,
212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237,
238, 239])
# this was
if len(set(example[idx])) != len(example[idx]):
dupes = np.array([x for n, x in enumerate(example[idx]) if x in example[idx][:n]]).squeeze()
idx = np.delete(idx, np.nonzero(np.in1d(example[idx], dupes).squeeze()[::2]))
array([199, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,
222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234,
235, 236, 237, 238, 239])
检索索引O(100)时此方法正常,但对于较大的数据集O(100,000)+(有时似乎并没有删除所有重复项)来说这是很慢的,因此我想出了一些方法似乎仍然很慢,我希望有人可以解释这些慢的原因,或者找到更好/更快的方法。速度是一个问题。
import time
# define testing function for test functions below
def timing(f, n, a):
print(f.__name__,)
r = range(n)
t1 = time.perf_counter()
for i in r:
f(a[0],a[1],a[2]); f(a[0],a[1],a[2]);
t2 = time.perf_counter()
print(round(t2-t1, 3))
def gettimeBase(example, minVal, maxVal):
# this is target (speed and simplicity), but returns duplicates
mask = (example >= minVal) & (example < maxVal)
idx = np.argwhere(mask).squeeze()
return idx
## now one's that don't return duplicates
def gettime1(example, minVal, maxVal):
mask = (example >= minVal) & (example < maxVal)
idx = np.argwhere(mask).squeeze()
if np.size(idx) == 0:
idx = None
if len(set(example[idx])) !=len(example[idx]):
## when there are duplicate times on the server
times, idxUnique = np.unique(example, return_index=True)
mask2 = (times >= minVal) & (times < maxVal)
idx2 = np.argwhere(mask2).squeeze()
idx = idxUnique[idx2].squeeze()
assert (sorted(set(example[idx])) == example[idx]).all(), 'Data Still have duplicate times'
return idx
def gettime2(example, minVal, maxVal):
if len(set(example)) != len(example):
## when there are duplicate times on the server
times, idxUnique = np.unique(example, return_index=True)
mask2 = (times >= minVal) & (times < maxVal)
idx2 = np.argwhere(mask2).squeeze()
idx = idxUnique[idx2].squeeze()
else:
mask = (example >= minVal) & (example < maxVal)
idx = np.argwhere(mask).squeeze()
if np.size(idx) == 0:
return None
assert (sorted(set(example[idx])) == example[idx]).all(), 'Data Still have duplicate times'
return idx
testdata = (example, minValue, maxValue)
testfuncs = gettimeBase, gettime1, gettime2
for f in testfuncs:
timing(f, 100, testdata)
测试结果是(python 3):
gettimeBase 0.127
gettime1 35.103
gettime2 74.953
答案 0 :(得分:2)
选项1
numpy.unique
这个选项很快,但是它会返回每个重复出现的 first 的索引,而在您的问题中,您似乎是在抓取 last 的索引。重复。这意味着此方法返回的索引将与您期望的输出不匹配,但是它们对应的值将相同。
vals, indices = np.unique(example[mask], return_index=True)
indices + np.argmax(mask)
array([199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 220, 221,
222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234,
235, 236, 237, 238, 239], dtype=int64)
这是我提到的警告:
desired = np.array([199, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,
222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234,
235, 236, 237, 238, 239])
np.array_equal(start + idx, desired)
# False
np.array_equal(example[start + idx], example[desired])
# True
选项2
numpy.unique
+ numpy.flip
f = np.flip(example[mask])
vals, indices = np.unique(f, return_index=True)
final = f.shape[0] - 1 - indices
final + np.argmax(mask)
array([199, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,
222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234,
235, 236, 237, 238, 239], dtype=int64)
这实际上捕获了最后一次出现,但增加了更多开销:
np.array_equal(final + idx[0], desired)
# True
性能 (我包括一些安装费用)
def chris1(arr, mn, mx):
mask = (arr < mx) & (arr > mn)
vals, indices = np.unique(arr[mask], return_index=True)
return indices + np.argmax(mask)
def chris2(arr, mn, mx):
mask = (arr < mx) & (arr > mn)
f = np.flip(arr[mask])
vals, indices = np.unique(f, return_index=True)
final = f.shape[0] - 1 - indices
return final + np.argmax(mask)
def sbfrf(arr, mn, mx):
mask = (arr < mx) & (arr > mn)
idx = np.argwhere(mask).squeeze()
if len(set(example[idx])) != len(example[idx]):
dupes = np.array([x for n, x in enumerate(example[idx]) if x in example[idx][:n]]).squeeze()
idx = np.delete(idx, np.nonzero(np.in1d(example[idx], dupes).squeeze()[::2]))
return idx
In [225]: %timeit chris1(example, 198_000, 230_000)
29.6 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [226]: %timeit chris2(example, 198_000, 230_000)
36.5 µs ± 98.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [227]: %timeit sbfrf(example, 198_000, 230_000)
463 µs ± 7.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)