NumPy中是否有任何内置操作返回数组中每个字符串的长度?
我不认为任何NumPy string operations会这样做,这是正确的吗?
我可以使用for
循环执行此操作,但也许有更高效的内容?
import numpy as np
arr = np.array(['Hello', 'foo', 'and', 'whatsoever'], dtype='S256')
sizes = []
for i in arr:
sizes.append(len(i))
print(sizes)
[5, 3, 3, 10]
答案 0 :(得分:12)
您可以使用vectorize
numpy
。它要快得多。
mylen = np.vectorize(len)
print mylen(arr)
答案 1 :(得分:5)
这里是几种方法的比较。
观察:
argmin
始终如一,而且速度最快。map
胜过列表理解np.frompyfunc
并在较小程度上np.vectorize
的表现优于其声誉。
method ↓↓ size →→ | 10| 100| 1000| 10000| 100000|1000000
------------------------------------+-------+-------+-------+-------+-------+-------
np.char.str_len | 0.005| 0.036| 0.313| 3.170| 30.698|309.058
list comprehension | 0.005| 0.029| 0.283| 2.812| 29.588|273.618
list comprehension after .tolist() | 0.002| 0.011| 0.109| 1.155| 12.888|133.759
map | 0.002| 0.008| 0.074| 0.825| 9.386|103.074
np.frompyfunc | 0.004| 0.010| 0.081| 0.892| 7.985| 81.841
np.vectorize | 0.024| 0.030| 0.115| 1.070| 11.557|124.228
viewcast after zero padding | 0.005| 0.006| 0.034| 0.298| 3.379| 35.487
viewcast | 0.010| 0.011| 0.037| 0.280| 2.886| 32.954
代码:
import numpy as np
flist = []
def timeme(name):
def wrap_gen(f):
flist.append((name, f))
return(f)
return wrap_gen
@timeme("np.char.str_len")
def np_char():
return np.char.str_len(A)
@timeme("list comprehension")
def lst_cmp():
return [len(a) for a in A]
@timeme("list comprehension after .tolist()")
def lst_cmp_opt():
return [len(a) for a in A.tolist()]
@timeme("map")
def map_():
return list(map(len, A.tolist()))
@timeme("np.frompyfunc")
def np_fpf():
return np.frompyfunc(len, 1, 1)(A)
@timeme("np.vectorize")
def np_vect():
return np.vectorize(len)(A)
@timeme("viewcast after zero padding")
def np_zt():
N = A.dtype.itemsize//4
return A.astype(f'U{N+1}').view(np.uint32).reshape(-1, N+1).argmin(1)
@timeme("viewcast")
def np_view():
v = A.view(np.uint32).reshape(A.size, -1)
l = np.argmin(v, 1)
l[v[np.arange(len(v)), l] > 0] = v.shape[-1]
return l
A = np.random.choice(
"Blindtext do not use the quick brown fox jumps over the lazy dog".split(),
1000000)
for _, f in flist[:-1]:
assert (f()==flist[-1][1]()).all()
from timeit import timeit
L = ['|+' + len(flist)*'|',
[f"{'method ↓↓ size →→':36s}", 36*'-']
+ [f"{name:36s}" for name, f in flist]]
for N in (10, 100, 1000, 10000, 100000, 1000000):
A = np.random.choice("Blindtext do not use the quick brown fox jumps"
" over the lazy dog".split(), N)
L.append([f"{N:>7d}", 7*'-']
+ [f"{timeit(f, number=10)*100:7.3f}" for name, f in flist])
for sep, *line in zip(*L):
print(*line, sep=sep)
答案 2 :(得分:4)
对我而言,这将是要走的路:
sizes = [len(i) for i in arr]
答案 3 :(得分:1)
使用str_len
中的Numpy
:
sizes = np.char.str_len(arr)
str_len文档:https://numpy.org/devdocs/reference/generated/numpy.char.str_len.html