
时间:2018-03-22 07:51:18

标签: string python-3.x cython

我正在尝试在python中编写不匹配内核。我想提取给定一个numpy布尔数组掩码的字符串的所有子串,其中提取模式不一定是连续的(例如mask = [False,True,False,True],这样从'ABCD'我提取'BD' )。在根据这种模式提取子串之后,我可以计算我的两个序列之间的所有常见子串。 关于提取步骤string[theta]不能提取这样的子字符串。我现在有以下代码块可用:

def function(s1, s2, k, theta):
 l1 = []
 l2 = []

 # substrings of s1
 substrk_itr1 = (s1[i:i+k] for i in range(len(s1) - k + 1))
 l1 = [''.join(substr[i] for i, b in enumerate(theta) if b)
       for substr in substrk_itr1]

 # substrings of s2
 substrk_itr2 = (s2[i:i+k] for i in range(len(s2) - k + 1))
 l2 = [''.join(substr[i] for i, b in enumerate(theta) if b)
       for substr in substrk_itr2]

 L = l1 + l2
 C = Counter(L)
 c1 = Counter(l1)
 c2 = Counter(l2)
 x = sum([c1[w] * c2[w] for w in C if w])
 return x


k = 5
theta = np.array([False,True, True, True, False])






2 个答案:

答案 0 :(得分:1)

只需将字符串转换为numpy数组(dtype=np.int8与字符大小相同)并将''.join(...)替换为布尔数组索引,就可以非常轻松地提高35%的速度: substr[theta]

def function(s1,k,theta):
    s1 = np.fromstring(s1,np.int8)

    substrk_itr1 = (s1[i:i+k] for i in range(len(s1) - k + 1))
    l1 = [substr[theta] for substr in substrk_itr1]

    l1 = [ x.tostring() for x in l1 ]

    # etc for s2


答案 1 :(得分:0)




from libc.stdint cimport int8_t
cimport cython

I have included three @cython decorators here to save some checks.
The first one, boundscheck is the only useful one in this scenario, actually.
You can see their effect if you generate cython annotations!
Read more about it here: 
def fast_count_substr_matches(str s1, str s2, int k, int8_t[:] theta):
    cdef int i, j, m#you used k, unfortunately, so stuck with m...
    cdef bytes b1 = s1.encode("utf-8")
    #alternatively, could just pass in bytes strings instead
    #by prefixing your string literals like b'AAATCGGGT'
    cdef bytes b2 = s2.encode("utf-8")
    cdef char* c1 = b1
    cdef char* c2 = b2
    #python str objects have minor overhead when accessing them with str[index]
    #this is why I bother converting them to char* at the start
    cdef int count = 0
    cdef bint comp#A C-type int that can be treated as python bool nicely

    for i in range(len(s1) - k + 1):
        for j in range(len(s2) - k + 1):
            comp = True
            for m in range(k):
                if theta[m] == True and c1[i + m] != c2[j + m]:
                    comp = False
            if comp:
                count += 1
    return count
