Cython字符串支持

时间:2019-01-17 16:54:58

标签: python numpy cython

我正在尝试优化一些代码。我已经使用Numpy和Numba设法优化了我的大部分项目,但是还剩下一些我无法使用这些工具进行优化的字符串处理代码。因此,我想尝试使用Cython优化此部分。

此处的代码采用游程长度编码的字符串(一个字母,可选地后跟一个数字,该数字表示该字母重复多少次)并将其扩展。然后使用字典查找将字母匹配为0和1的序列,从而将扩展的字符串转换为0和1的数组。

是否可以使用Cython优化此代码?

import numpy as np
import re

vector_list = ["A22gA5BA35QA17gACA3QA7gA9IAAgEIA3wA3gCAAME@EACRHAQAAQBACIRAADQAIA3wAQEE}rm@QfpT}/Mp-.n?",
                "A64IA13CA5RA13wAABA5EAECA5EA4CEgEAABGCAAgAyAABolBCA3WA4GADkBOA?QQgCIECmth.n?"]


_base64chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz@}]^+-*/?,._"
_bin2base64 = {"{:06b}".format(i): base64char for i, base64char in enumerate(_base64chars)}
_base642bin = {v: k for k, v in _bin2base64.items()}

_n_vector_ranks_only = np.arange(1023,-1,-1)


def _decompress_get(data):
    for match in re.finditer(r"(?P<char>.)((?P<count>\d+))?", data):
        if not match.group("count"): yield match.group("char")
        else: yield match.group("char") * int(match.group("count"))


def _n_apply_weights(vector):
    return np.multiply(vector, _n_vector_ranks_only)

def n_decompress(compressed_vector):
    decompressed_b64 = "".join(_decompress_get(compressed_vector))
    vectorized = "".join(_base642bin[c] for c in decompressed_b64)[:-2]
    as_binary = np.fromiter(vectorized, int)
    return as_binary


def test(x, y):
    if len(x) != 1024:
        x = n_decompress(x)
    vector_a = _n_apply_weights(x)
    if len(y) != 1024:
        y = n_decompress(y)
    vector_b = _n_apply_weights(y)
    maxPQ = np.sum(np.maximum(vector_a, vector_b))
    return np.sum(np.minimum(vector_a, vector_b))/maxPQ

v1 = vector_list[0]
v2= vector_list[1]
print(test(v1, v2))

1 个答案:

答案 0 :(得分:0)

You can get a pretty good speed-up on the second part of the problem (which you're doing through a dictionary lookup) using Numpy alone. I've replaced the dictionary lookup by indexing into a Numpy array.

I generate the Numpy array at the start. One trick is to realise that letters can be converted into the underlying number that represents them using ord. For an ASCII string this is always between 0 and 127:

_base642bin_array = np.zeros((128,),dtype=np.uint8)
for i in range(len(_base64chars)):
    _base642bin_array[ord(_base64chars[i])] = i

I do the conversion into 1s and 0s in the n_decompress function, using a built-in numpy function.

def n_decompress2(compressed_vector):
    # encode is for Python 3: str -> bytes
    decompressed_b64 = "".join(_decompress_get(compressed_vector)).encode()
    # byte string into the underlying numeric data
    decompressed_b64 = np.fromstring(decompressed_b64,dtype=np.uint8)
    # conversion done by numpy indexing rather than dictionary lookup
    vectorized = _base642bin_array[decompressed_b64]
    # convert to a 2D array of 1s and 0s
    as_binary = np.unpackbits(vectorized[:,np.newaxis],axis=1)
    # remove the two digits you don't care about (always 0) from binary array
    as_binary = as_binary[:,2:]
    # reshape to 1D (and chop off two at the end)
    return as_binary.ravel()[:-2]

This gives me a 2.4x speed over your version (note that I haven't changed _decompress_get at all, so both timings include your _decompress_get) just from using Numpy (no Cython/Numba, and I suspect they won't help too much). I think the main advantage is that indexing into an array with numbers is fast compared to a dictionary lookup.


_decompress_get probably could be improved using Cython but it's a significantly harder problem...