我想按如下方式重新采样序列:
fastadict = {"seq1" : "ATGCAGTCACGT", "seq2" : "ATGTGTGTACG"}
我写了以下函数:
import sys
import random
def resampling_f(fastadict, seq, num):
fastadict[seq] = fastadict[seq].replace("N","").replace("n","")
l = []
new_seq = ''.join([random.choice(fastadict[seq]) for i in range(num)])
l.append(new_seq)
return l
# Run function for 20 replicates:
for i in range(20):
print resampling_f(fastadict, "seq1", 10)
这适用于示例中的小序列。在我的工作中,我需要对大约100万个字母(DNA的基础,ACTG)进行10000次采样。此功能对于此目的而言太慢。是否有更快的方法来获取使用python替换的采样?
答案 0 :(得分:2)
使用>>> n, k = 20, 10
>>>
>>> np.random.choice((*fastadict['seq1'],), replace=True, size=(n * k,)).view(f'U{k}')
array(['GCAGAATGCT', 'GGAGGTGCAT', 'CACCATCATT', 'CGTGGTGTAC',
'AGAATATCGG', 'GATTTTGGCC', 'GAACATAAGC', 'CGGGCCAAGC',
'GTTGGTGTTT', 'GACCAATAAC', 'ACCCGTAGCC', 'GAATTCCCGG',
'AACAGGTCAC', 'AGACAAGCAC', 'CACACTTGCC', 'CGTTTGTAAT',
'CTAGCCCTCG', 'CTCGACATAT', 'GATGATTAGA', 'TCTATCCTCA'],
dtype='<U10')
模块,它提供矢量化采样和视图转换:
>>> np.random.choice(tuple(fastadict['seq1']), replace=True, size=(n * k,)).view('S{k}'.format(k=k))
Python 2版本:
>>> from time import perf_counter
>>> n, k = 100, 1000000
>>> t0 = perf_counter(); x = np.random.choice((*fastadict['seq1'],), replace=True, size=(n * k,)).view(f'U{k}');t1 = perf_counter()
>>> t1-t0
1.29188625497045
速度:
library(magrittr)
library(dplyr)
B1 = A %>%
summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d< 5) ),
i0_10 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d< 10) ),
i0_15 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d< 15) ),
i0_20 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d< 20) ),
i0_25 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d< 25) ),
i0_30 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d< 30) ),
i0_35 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d<= 35) ) )
B1=as.data.frame(t(B1))
B2 = A %>%
summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d< 5) ),
i0_10 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d< 10) ),
i0_15 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d< 15) ),
i0_20 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d< 20) ),
i0_25 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d< 25) ),
i0_30 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d< 30) ),
i0_35 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d<= 35) ) )
B2=as.data.frame(t(B2))
B3 = A %>%
summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d< 5) ),
i0_10 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d< 10) ),
i0_15 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d< 15) ),
i0_20 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d< 20) ),
i0_25 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d< 25) ),
i0_30 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d< 30) ),
i0_35 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d<= 35) ) )
B3=as.data.frame(t(B3))
B4 = A %>%
summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d< 5) ),
i0_10 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d< 10) ),
i0_15 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d< 15) ),
i0_20 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d< 20) ),
i0_25 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d< 25) ),
i0_30 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d< 30) ),
i0_35 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d<= 35) ) )
B4=as.data.frame(t(B4))
B5 = A %>%
summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d< 5) ),
i0_10 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d< 10) ),
i0_15 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d< 15) ),
i0_20 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d< 20) ),
i0_25 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d< 25) ),
i0_30 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d< 30) ),
i0_35 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d<= 35) ) )
B5=as.data.frame(t(B5))
B6 = A %>%
summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d< 5) ),
i0_10 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d< 10) ),
i0_15 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d< 15) ),
i0_20 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d< 20) ),
i0_25 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d< 25) ),
i0_30 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d< 30) ),
i0_35 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d<= 35) ) )
B6=as.data.frame(t(B6))
B7 = A %>%
summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 35) & (d >=0 & d< 5) ),
i0_10 = sum( (a > 0) & (b>= 0 & b< 35) & (d >=0 & d< 10) ),
i0_15 = sum( (a > 0) & (b>= 0 & b<= 35) & (d >=0 & d< 15) ),
i0_20 = sum( (a > 0) & (b>= 0 & b<= 35) & (d >=0 & d< 20) ),
i0_25 = sum( (a > 0) & (b>= 0 & b<= 35) & (d >=0 & d< 25) ),
i0_30 = sum( (a > 0) & (b>= 0 & b<= 35) & (d >=0 & d< 30) ),
i0_35 = sum( (a > 0) & (b>= 0 & b<= 35) & (d >=0 & d<= 35) ) )
B7=as.data.frame(t(B7))
Em=cbind(B1,B2,B3,B4,B5,B6,B7)
colnames(Em) =c('0-5','0-10','0-15','0-20','0-25','0-30','0-35')
答案 1 :(得分:1)
我使用__init__
将您的版本与使用__dict__
的其他版本进行了比较。后者更快。后面的重新采样通过基于字符串的长度生成随机数,并将其作为字符串的索引。
以下是要比较的两个功能:您的random.choice
和random.uniform
。
resampling_f
而不是循环内的resample_new
,而是在循环外部进行,以便只进行一次。
以下是比较它们的代码:
import random
import time
fastadict = {"seq1" : "ATGCAGTCACGT", "seq2" : "ATGTGTGTACG"};
def resampling_f(fastadict, seq, num):
fastadict[seq] = fastadict[seq].replace("N","").replace("n","")
l = []
new_seq = ''.join([random.choice(fastadict[seq]) for i in range(num)])
l.append(new_seq)
return l
def resample_new(data, num):
new= ''.join([data[int(random.uniform(0,num))] for i in range(num)]);
return new
我的fastadict[seq].replace....
为start_1=time.time();
# Run function for 20000 replicates:
for i in range(20000):
print(resampling_f(fastadict, "seq1", 10))
total_1=time.time()-start_1;
start_2=time.time();
data = fastadict["seq1"].replace("N","").replace("n","");
# Run function for 20000 replicates:
for i in range(20000):
print(resample_new(data, 10))
total_2=time.time()-start_2;
print("First one: "+str(total_1));
print("Second one: "+str(total_2));
秒,total_1
为3.6
秒。
如果200,000个重新采样,我得到第一个total_2
秒,第二个是2.966
秒。
答案 2 :(得分:0)
我怀疑它的原因很慢,每次你拨打random_choice时,它都会影响整个系列。您可以改为对系列进行一次测距,计算每个项目出现的次数,然后从该分布中进行采样,例如:使用numpy.random.choice。