我正在尝试优化一些代码。我已经使用Numpy和Numba设法优化了我的大部分项目,但是还剩下一些我无法使用这些工具进行优化的字符串处理代码。因此,我想尝试使用Cython优化此部分。
此处的代码采用游程长度编码的字符串(一个字母,可选地后跟一个数字,该数字表示该字母重复多少次)并将其扩展。然后使用字典查找将字母匹配为0和1的序列,从而将扩展的字符串转换为0和1的数组。
是否可以使用Cython优化此代码?
import numpy as np
import re
vector_list = ["A22gA5BA35QA17gACA3QA7gA9IAAgEIA3wA3gCAAME@EACRHAQAAQBACIRAADQAIA3wAQEE}rm@QfpT}/Mp-.n?",
"A64IA13CA5RA13wAABA5EAECA5EA4CEgEAABGCAAgAyAABolBCA3WA4GADkBOA?QQgCIECmth.n?"]
_base64chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz@}]^+-*/?,._"
_bin2base64 = {"{:06b}".format(i): base64char for i, base64char in enumerate(_base64chars)}
_base642bin = {v: k for k, v in _bin2base64.items()}
_n_vector_ranks_only = np.arange(1023,-1,-1)
def _decompress_get(data):
for match in re.finditer(r"(?P<char>.)((?P<count>\d+))?", data):
if not match.group("count"): yield match.group("char")
else: yield match.group("char") * int(match.group("count"))
def _n_apply_weights(vector):
return np.multiply(vector, _n_vector_ranks_only)
def n_decompress(compressed_vector):
decompressed_b64 = "".join(_decompress_get(compressed_vector))
vectorized = "".join(_base642bin[c] for c in decompressed_b64)[:-2]
as_binary = np.fromiter(vectorized, int)
return as_binary
def test(x, y):
if len(x) != 1024:
x = n_decompress(x)
vector_a = _n_apply_weights(x)
if len(y) != 1024:
y = n_decompress(y)
vector_b = _n_apply_weights(y)
maxPQ = np.sum(np.maximum(vector_a, vector_b))
return np.sum(np.minimum(vector_a, vector_b))/maxPQ
v1 = vector_list[0]
v2= vector_list[1]
print(test(v1, v2))
答案 0 :(得分:0)
You can get a pretty good speed-up on the second part of the problem (which you're doing through a dictionary lookup) using Numpy alone. I've replaced the dictionary lookup by indexing into a Numpy array.
I generate the Numpy array at the start. One trick is to realise that letters can be converted into the underlying number that represents them using ord
. For an ASCII string this is always between 0 and 127:
_base642bin_array = np.zeros((128,),dtype=np.uint8)
for i in range(len(_base64chars)):
_base642bin_array[ord(_base64chars[i])] = i
I do the conversion into 1s and 0s in the n_decompress
function, using a built-in numpy function.
def n_decompress2(compressed_vector):
# encode is for Python 3: str -> bytes
decompressed_b64 = "".join(_decompress_get(compressed_vector)).encode()
# byte string into the underlying numeric data
decompressed_b64 = np.fromstring(decompressed_b64,dtype=np.uint8)
# conversion done by numpy indexing rather than dictionary lookup
vectorized = _base642bin_array[decompressed_b64]
# convert to a 2D array of 1s and 0s
as_binary = np.unpackbits(vectorized[:,np.newaxis],axis=1)
# remove the two digits you don't care about (always 0) from binary array
as_binary = as_binary[:,2:]
# reshape to 1D (and chop off two at the end)
return as_binary.ravel()[:-2]
This gives me a 2.4x speed over your version (note that I haven't changed _decompress_get
at all, so both timings include your _decompress_get
) just from using Numpy (no Cython/Numba, and I suspect they won't help too much). I think the main advantage is that indexing into an array with numbers is fast compared to a dictionary lookup.
_decompress_get
probably could be improved using Cython but it's a significantly harder problem...