在我正在编写的某些Python代码中,我需要计算字符串中任一字符集出现的次数。换句话说,我需要计算字符串中字符[c1,c2,c3,...,cn]的总出现次数。
在C语言中,称为strpbrk()
的函数可用于执行此操作,通常在x86处理器上带有特殊指令以使其更快。
在Python中,我编写了以下代码,但这是我的应用程序中最慢的部分。
haystack = <query string>
gc_characters = 0
for c in ['c', 'C', 'g', 'G']:
gc_characters += haystack.count(c)
有更快的方法吗?
答案 0 :(得分:3)
我可能在这里有点过头了,但是tl; dr是,原始代码实际上比(编辑:macOS的)strpbrk()
快,但是某些strpbrk()
的实现可能更快!
str.count()
在其内胆中使用this bundle of strange and beautiful magic –难怪它很快。
完整的代码位于https://github.com/akx/so55822235
这些方法都是纯Python编写的,包括OP的原始版本
def gc_characters_original(haystack):
gc_characters = 0
for c in ("c", "C", "g", "G"):
gc_characters += haystack.count(c)
return gc_characters
def gc_characters_counter(haystack):
counter = Counter(haystack)
return sum(counter.get(c, 0) for c in ["c", "C", "g", "G"])
def gc_characters_manual(haystack):
gc_characters = 0
for x in haystack:
if x in ("c", "C", "g", "G"):
gc_characters += 1
return gc_characters
def gc_characters_iters(haystack):
gc_characters = haystack.count("c") + haystack.count("C") + haystack.count("g") + haystack.count("G")
return gc_characters
strpbrk()
from libc.string cimport strpbrk
cdef int _count(char* s, char *key):
assert s is not NULL, "byte string value is NULL"
cdef int n = 0
cdef char* pch = strpbrk (s, key)
while pch is not NULL:
n += 1
pch = strpbrk (pch + 1, key)
return n
def count(s, key):
return _count(s, key)
...
def gc_characters_cython(haystack_bytes):
return charcount_cython.count(haystack_bytes, b"cCgG")
strpbrk()
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <string.h>
static unsigned int count(const char *str, const char *key) {
unsigned int n = 0;
char *pch = strpbrk(str, key);
while (pch != NULL) {
n++;
pch = strpbrk(pch + 1, key);
}
return n;
}
static PyObject *charcount_count(PyObject *self, PyObject *args) {
const char *str, *key;
Py_ssize_t strl, keyl;
if (!PyArg_ParseTuple(args, "s#s#", &str, &strl, &key, &keyl)) {
PyErr_SetString(PyExc_RuntimeError, "invalid arguments");
return NULL;
}
int n = count(str, key);
return PyLong_FromLong(n);
}
static PyMethodDef CharCountMethods[] = {
{"count", charcount_count, METH_VARARGS,
"Count the total occurrences of any s2 characters in s1"},
{NULL, NULL, 0, NULL},
};
static struct PyModuleDef spammodule = {PyModuleDef_HEAD_INIT, "charcount",
NULL, -1, CharCountMethods};
PyMODINIT_FUNC PyInit_charcount(void) { return PyModule_Create(&spammodule); }
...
def gc_characters_cext_b(haystack_bytes):
return charcount.count(haystack_bytes, b"cCgG")
def gc_characters_cext_u(haystack):
return charcount.count(haystack, "cCgG")
在我的Mac上,将cCgG
计数为一百万个字符的随机字母字符串,即
haystack = "".join(random.choice(string.ascii_letters) for x in range(1_000_000))
haystack_bytes = haystack.encode()
print("original", timeit.timeit(lambda: gc_characters_original(haystack), number=100))
print("unrolled", timeit.timeit(lambda: gc_characters_iters(haystack), number=100))
print("cython", timeit.timeit(lambda: gc_characters_cython(haystack_bytes), number=100))
print("c extension, bytes", timeit.timeit(lambda: gc_characters_cext_b(haystack_bytes), number=100))
print("c extension, unicode", timeit.timeit(lambda: gc_characters_cext_u(haystack), number=100))
print("manual loop", timeit.timeit(lambda: gc_characters_manual(haystack), number=100))
print("counter", timeit.timeit(lambda: gc_characters_counter(haystack), number=100))
产生以下结果:
original 0.34033612700000004
unrolled 0.33661798900000006
cython 0.6542106270000001
c extension, bytes 0.46668797900000003
c extension, unicode 0.4761082090000004
manual loop 11.625232557
counter 7.0389275090000005
因此,除非我的Mac的strpbrk()
中的libc
实现功能严重不足(编辑:确实如此),否则最好使用.count()
。
我添加了glibc's strcspn()
/strpbrk()
,它比the näive version of strpbrk()
shipped with macOS快得惊人:
original 0.329256
unrolled 0.333872
cython 0.433299
c extension, bytes 0.432552
c extension, unicode 0.437332
c extension glibc, bytes 0.169704 <-- new
c extension glibc, unicode 0.158153 <-- new
glibc
还具有SSE2和SSE4版本的功能,它们的速度可能甚至还要更快。
我又回到了这一次,因为我对glibc的strcspn()
的巧妙查找表如何用于字符计数有了顿悟:
size_t fastcharcount(const char *str, const char *haystack) {
size_t count = 0;
// Prepare lookup table.
// It will contain 1 for all characters in the haystack.
unsigned char table[256] = {0};
unsigned char *ts = (unsigned char *)haystack;
while(*ts) table[*ts++] = 1;
unsigned char *s = (unsigned char *)str;
#define CHECK_CHAR(i) { if(!s[i]) break; count += table[s[i]]; }
for(;;) {
CHECK_CHAR(0);
CHECK_CHAR(1);
CHECK_CHAR(2);
CHECK_CHAR(3);
s += 4;
}
#undef CHECK_CHAR
return count;
}
结果非常令人印象深刻,优于glibc实现4倍和原始Python实现8.5倍。
original | 6.463880 sec / 2000 iter | 309 iter/s
unrolled | 6.378582 sec / 2000 iter | 313 iter/s
cython libc | 8.443358 sec / 2000 iter | 236 iter/s
cython glibc | 2.936697 sec / 2000 iter | 681 iter/s
cython fast | 0.766082 sec / 2000 iter | 2610 iter/s
c extension, bytes | 8.373438 sec / 2000 iter | 238 iter/s
c extension, unicode | 8.394805 sec / 2000 iter | 238 iter/s
c extension glib, bytes | 2.988184 sec / 2000 iter | 669 iter/s
c extension glib, unicode | 2.992429 sec / 2000 iter | 668 iter/s
c extension fast, bytes | 0.754072 sec / 2000 iter | 2652 iter/s
c extension fast, unicode | 0.762074 sec / 2000 iter | 2624 iter/s
答案 1 :(得分:1)
.count
时都会在haystack
上进行迭代-但在我在此建议的替代方法中,{p> {3}}。这取决于您的实际情况中有多少个字符。您可以尝试
from collections import Counter
cnt = Counter(haystack)
gc_characters = sum(cnt.get(e, 0) for e in ['c', 'C', 'g', 'G']])
,因为这将遍历字符串一次并存储每个出现的字符的计数。仅查找您关心的字符并为这些字符使用一组字符可能会更快__contains__
。
gc_chars = {'c', 'C', 'g', 'G'}
counts = {e: 0 for e in gc_chars}
for c in gc_chars:
if c in gc_chars:
counts[c] += 1
gc_characters = sum(counts.values())
如果您提供有关hastack
的组成以及调用频率的更多详细信息,我们可以尝试为您提供更多帮助。
另一个想法是,如果hastack
经常是相同的,则可以在内存中保留答案的缓存
from functools import lru_cache
@lru_cache
def haystack_metric(hastack):
return sum(haystack.count(c) for c in ['c', 'C', 'g', 'G']))
(无论您采用哪种实现方式)。您也可以探索heavily optimized-但我对此经验很少。