似乎使用numpy dtypes(特别是uint32)进行数学运算比在常规python int上进行数学运算需要更长的时间。这是我现实生活中的示例代码:
import numpy
## Binary encoding of DNA as python int
bDic = {'A': 0 ,'C': 1 ,'G': 2 ,'T': 3 } # DNA to 32bit binary...
tDic = ['A', 'C', 'G', 'T' ] # ...and back again :)
range32 = range(0,32,2)
def string_up2bit(string):
up2bit = 3
for char in reversed(string): up2bit = (up2bit << 2) + bDic[char]
return up2bit
def up2bit_string(value):
up2bits = [((value >> x) & 3) for x in range32]
return ''.join([tDic[up2bit] for up2bit in up2bits[:-up2bits[::-1].index(3)-1]])
## Binary encoding of DNA as numpy uint32 (what i will actually be saving to disk)
n0,n1,n2,n3 = numpy.uint32(0),numpy.uint32(1),numpy.uint32(2),numpy.uint32(3)
npbDic = { 'A': n0 ,'C': n1 ,'G': n2 ,'T': n3 } # DNA to 32bit binary...
nptDic = { n0 :'A', n1 :'C', n2 :'G', n3 :'T' } # ...and back again :)
nprange32 = list(numpy.arange(0,32,2,dtype='uint32'))
def np_string_up2bit(string):
up2bit = n3
for char in reversed(string): up2bit = (up2bit << n2) + npbDic[char]
return up2bit
def np_up2bit_string(value):
up2bits = [((value >> x) & n3) for x in nprange32] # The 32 here makes it 32bit only.
return ''.join([nptDic[up2bit] for up2bit in up2bits[:-up2bits[::-1].index(n3)-1]])
## Begin test:
## Read 10000000 lines of DNA from a file, convert into binary and back again.
DNA = 'ATTCGACTTGACTG'
r = 0
while r != 10000000:
r += 1
#up2bit_string(string_up2bit(DNA)) # Takes 1min 12sec
np_up2bit_string(np_string_up2bit(DNA)) # Takes 1min 45sec
正如您在底部看到的那样,使用numpy uint32比python int版本长45%。在上面的代码中,不应该将NumPy uint32s转换为python int来解释减速,只是使用uint32s似乎更慢。这转化为现实世界数据集上额外计算时间的天数。
有谁知道如何加快速度?也许有一种方法可以将python中的uint32数学作为默认值?也许我应该尝试ctypes而不是numpy dtypes?
已编辑,因此任何人都可以通过提供DNA数据来测试代码。