Question

我正在尝试加速生成字符串所有可能拆分的代码。

splits('foo') -> [('f', 'oo'), ('fo', 'o'), ('foo', '')]

python中的代码非常简单：

def splits(text):
    return [(text[:i + 1], text[i + 1:])
            for i in range(len(text))]

有没有办法通过cython或其他方式加快速度？对于上下文，此代码的更大目的是找到具有最高概率的字符串的拆分。

Answer 1

这不是Cython倾向于帮助的问题。它使用切片，最终与纯Python的速度大致相同（即实际上相当不错）。

在b'0'*100中使用100个字符的长字节字符串（timeit）和10000次迭代，我得到：

您编写的代码 - 0.37s
您编写的代码，但在Cython中编译 - 0.21s
您的代码行cdef int i并使用Cython编译 - 0.20s（这是一个很小的改进。对于更长的字符串，它更重要）
您的cdef int i和参数输入bytes text - 0.28s（即更糟）。

直接使用Python C API获得最佳速度（参见下面的代码） - 0.11s。为了方便起见，我选择在Cython中选择这样做（但是自己调用API函数），但是你可以直接在C中编写非常相似的代码，并进行更多的手动错误检查。我已经为Python 3 API编写了这个，假设你正在使用字节对象（即PyBytes而不是PyString），所以如果你正在使用Python 2，或者Unicode和Python 3你必须稍微改变一下。

from cpython cimport *
cdef extern from "Python.h":
    # This isn't included in the cpython definitions
    # using PyObject* rather than object lets us control refcounting
    PyObject* Py_BuildValue(const char*,...) except NULL

def split(text):
   cdef Py_ssize_t l,i
   cdef char* s

   # Cython automatically checks the return value and raises an error if 
   # these fail. This provides a type-check on text
   PyBytes_AsStringAndSize(text,&s,&l)
   output = PyList_New(l)

   for i in range(l):
       # PyList_SET_ITEM steals a reference
       # the casting is necessary to ensure that Cython doesn't
       # decref the result of Py_BuildValue
       PyList_SET_ITEM(output,i,
                       <object>Py_BuildValue('y#y#',s,i+1,s+i+1,l-(i+1)))
   return output

如果您不想一直使用C API，那么预先分配列表output = [None]*len(text)并执行for循环而非列表理解的版本稍微多一点效率高于原始版本 - 0.18s

总之，只需在Cython中编译它就可以获得不错的速度（略低于2倍）并且设置i的类型会有所帮助。这是您通过常规方式实现的所有功能。要获得全速，您基本上需要直接使用Python C API。这让你的速度提高了4倍，我认为这相当不错。

Cythonize字符串

1 个答案: