使用Python进行高效的重采样

时间:2018-02-17 15:51:34

标签: python numpy random

我想按如下方式重新采样序列:

fastadict = {"seq1" : "ATGCAGTCACGT", "seq2" : "ATGTGTGTACG"}

我写了以下函数:

import sys
import random

def resampling_f(fastadict, seq, num):
    fastadict[seq] = fastadict[seq].replace("N","").replace("n","")
    l = []
    new_seq = ''.join([random.choice(fastadict[seq]) for i in range(num)]) 
    l.append(new_seq)
    return l

# Run function for 20 replicates:
for i in range(20):
    print resampling_f(fastadict, "seq1", 10)

这适用于示例中的小序列。在我的工作中,我需要对大约100万个字母(DNA的基础,ACTG)进行10000次采样。此功能对于此目的而言太慢。是否有更快的方法来获取使用python替换的采样?

3 个答案:

答案 0 :(得分:2)

使用>>> n, k = 20, 10 >>> >>> np.random.choice((*fastadict['seq1'],), replace=True, size=(n * k,)).view(f'U{k}') array(['GCAGAATGCT', 'GGAGGTGCAT', 'CACCATCATT', 'CGTGGTGTAC', 'AGAATATCGG', 'GATTTTGGCC', 'GAACATAAGC', 'CGGGCCAAGC', 'GTTGGTGTTT', 'GACCAATAAC', 'ACCCGTAGCC', 'GAATTCCCGG', 'AACAGGTCAC', 'AGACAAGCAC', 'CACACTTGCC', 'CGTTTGTAAT', 'CTAGCCCTCG', 'CTCGACATAT', 'GATGATTAGA', 'TCTATCCTCA'], dtype='<U10') 模块,它提供矢量化采样和视图转换:

>>> np.random.choice(tuple(fastadict['seq1']), replace=True, size=(n * k,)).view('S{k}'.format(k=k))

Python 2版本:

>>> from time import perf_counter
>>> n, k = 100, 1000000
>>> t0 = perf_counter(); x = np.random.choice((*fastadict['seq1'],), replace=True, size=(n * k,)).view(f'U{k}');t1 = perf_counter()
>>> t1-t0
1.29188625497045

速度:

library(magrittr)
library(dplyr)

B1 = A %>%
  summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d<  5) ),
             i0_10 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d< 10) ),
             i0_15 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d< 15) ),
             i0_20 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d< 20) ),
             i0_25 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d< 25) ),
             i0_30 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d< 30) ),
             i0_35 = sum( (a > 0) & (b>= 0 & b< 5) & (d >=0 & d<= 35) ) )

B1=as.data.frame(t(B1))

B2 = A %>%
  summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d<  5) ),
             i0_10 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d< 10) ),
             i0_15 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d< 15) ),
             i0_20 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d< 20) ),
             i0_25 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d< 25) ),
             i0_30 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d< 30) ),
             i0_35 = sum( (a > 0) & (b>= 0 & b< 10) & (d >=0 & d<= 35) ) )

B2=as.data.frame(t(B2))

B3 = A %>%
  summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d<  5) ),
             i0_10 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d< 10) ),
             i0_15 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d< 15) ),
             i0_20 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d< 20) ),
             i0_25 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d< 25) ),
             i0_30 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d< 30) ),
             i0_35 = sum( (a > 0) & (b>= 0 & b< 15) & (d >=0 & d<= 35) ) )

B3=as.data.frame(t(B3))

B4 = A %>%
  summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d<  5) ),
             i0_10 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d< 10) ),
             i0_15 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d< 15) ),
             i0_20 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d< 20) ),
             i0_25 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d< 25) ),
             i0_30 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d< 30) ),
             i0_35 = sum( (a > 0) & (b>= 0 & b< 20) & (d >=0 & d<= 35) ) )

B4=as.data.frame(t(B4))

B5 = A %>%
  summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d<  5) ),
             i0_10 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d< 10) ),
             i0_15 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d< 15) ),
             i0_20 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d< 20) ),
             i0_25 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d< 25) ),
             i0_30 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d< 30) ),
             i0_35 = sum( (a > 0) & (b>= 0 & b< 25) & (d >=0 & d<= 35) ) )

B5=as.data.frame(t(B5))

B6 = A %>%
  summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d<  5) ),
             i0_10 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d< 10) ),
             i0_15 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d< 15) ),
             i0_20 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d< 20) ),
             i0_25 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d< 25) ),
             i0_30 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d< 30) ),
             i0_35 = sum( (a > 0) & (b>= 0 & b< 30) & (d >=0 & d<= 35) ) )

B6=as.data.frame(t(B6))

B7 = A %>%
  summarise( i0_5 = sum( (a > 0) & (b>= 0 & b< 35) & (d >=0 & d<  5) ),
             i0_10 = sum( (a > 0) & (b>= 0 & b< 35) & (d >=0 & d< 10) ),
             i0_15 = sum( (a > 0) & (b>= 0 & b<= 35) & (d >=0 & d< 15) ),
             i0_20 = sum( (a > 0) & (b>= 0 & b<= 35) & (d >=0 & d< 20) ),
             i0_25 = sum( (a > 0) & (b>= 0 & b<= 35) & (d >=0 & d< 25) ),
             i0_30 = sum( (a > 0) & (b>= 0 & b<= 35) & (d >=0 & d< 30) ),
             i0_35 = sum( (a > 0) & (b>= 0 & b<= 35) & (d >=0 & d<= 35) ) )

B7=as.data.frame(t(B7))
Em=cbind(B1,B2,B3,B4,B5,B6,B7)
colnames(Em) =c('0-5','0-10','0-15','0-20','0-25','0-30','0-35')

答案 1 :(得分:1)

我使用__init__将您的版本与使用__dict__的其他版本进行了比较。后者更快。后面的重新采样通过基于字符串的长度生成随机数,并将其作为字符串的索引。

以下是要比较的两个功能:您的random.choicerandom.uniform

resampling_f

而不是循环内的resample_new,而是在循环外部进行,以便只进行一次。

以下是比较它们的代码:

import random
import time

fastadict = {"seq1" : "ATGCAGTCACGT", "seq2" : "ATGTGTGTACG"};

def resampling_f(fastadict, seq, num): 
    fastadict[seq] = fastadict[seq].replace("N","").replace("n","") 
    l = [] 
    new_seq = ''.join([random.choice(fastadict[seq]) for i in range(num)]) 
    l.append(new_seq) 
    return l

def resample_new(data, num):
    new= ''.join([data[int(random.uniform(0,num))] for i in range(num)]);
    return new

我的fastadict[seq].replace....start_1=time.time(); # Run function for 20000 replicates: for i in range(20000): print(resampling_f(fastadict, "seq1", 10)) total_1=time.time()-start_1; start_2=time.time(); data = fastadict["seq1"].replace("N","").replace("n",""); # Run function for 20000 replicates: for i in range(20000): print(resample_new(data, 10)) total_2=time.time()-start_2; print("First one: "+str(total_1)); print("Second one: "+str(total_2)); 秒,total_13.6秒。

如果200,000个重新采样,我得到第一个total_2秒,第二个是2.966秒。

答案 2 :(得分:0)

我怀疑它的原因很慢,每次你拨打random_choice时,它都会影响整个系列。您可以改为对系列进行一次测距,计算每个项目出现的次数,然后从该分布中进行采样,例如:使用numpy.random.choice