我修改了代码以使用两个文件。 to_search.txt包含要搜索的字符串。 big_file.fastq包含要搜索的行,如果找到了字符串(允许2个不匹配,精确长度范围为8-10,无添加和删除),请放在相应的名称中。因此,在big_file.fastq中的所有行(第2行)中搜索每个字符串。
# to_search.txt: (length can be from 8-20 characters)
1 TCCCTTGT
2 ACGAGACT
3 GCTGTACG
4 ATCACCAG
5 TGGTCAAC
6 ATCGCACA
7 GTCGTGTA
8 AGCGGAGG
9 ATCCTTTG
10 TACAGCGC
#2000 search needed
# big_file.fastq: 2 billions lines (each 4 lines are associated: string search is in second line of each 4 lines).
# Second line can have 100-200 characters
@M04398:19:000000000-APDK3:1:1101:21860:1000 1:N:0:1
TCttTTGTGATCGATCGATCGATCGATCGGTCGTGTAGCCTCCAACCAAGCACCCCATCTGTTCCAAATCTTCTCCCACTGCTACTTGAAGACGCTGAAGTTGAAGGGCCACCTTCATCATTCTGG
+
#8ACCDGGGGGGGGGGGGGEGGGGGGGGDFFEGGG@FFGGGGGGGGGGGGGGGGGCF@<FFGGGGGFE9FGGGFEACE8EFFFGGGGGF9F?CECEFEG,CFGF,CF@CCC?BFFG<,9<9AEFG,,
@M04398:19:000000000-APDK3:1:1101:13382:1000 1:N:0:1
NTCGATCGATCGATCGATCGATCGATCGTTCTGAGAGGTACCAACCAAGCACACCACGGGCGACACAGACAGCTCCGTGTTGAACGGGTTGTTCTTCTTCTTGCCTTCATCATCCCCATCCTCAGTGGACGCAGCTTGCTCATCCTTCCTC
+
#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@M04398:19:000000000-APDK3:1:1101:18888:1000 1:N:0:1
NCAGAATGAGGAAGGATGAGCCCCGTCGTGTCGAAGCTATTGACACAGCGCTATTCCGTCTTTATGTTCACTTTAAGCGGTACAAGGAGCTGCTTGTTCTGATTCAGGAACCGAACCCTGGTGGTGTGCTTGGTTGGCAAGTTTACGGCTC
+
#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGE
以下是两个不匹配的代码。我试过完全匹配,速度也不错:需要一天左右。我使用过Time :: Progress模块。当我使用2个不匹配时:显示115天完成。如何提高速度?
#!/usr/bin/perl
use strict;
use warnings;
$| = 1;
open( IN_P1, "big_file.fastq" ) or die "File not found";
my ( @sample_file_names, @barcode1 );
open( BC_FILE, "to_search.txt" ) or die "No barcode file";
my @barcode_file_content = <BC_FILE>;
foreach (@barcode_file_content) {
chomp $_;
$_ =~ s/\r//;
$_ =~ s/\n//;
#print $_;
my @elements = split( "(\t|,| )", $_ );
push @sample_file_names, $elements[0];
push @barcode1, $elements[2];
}
# open FH
my @fh_array_R1;
foreach (@sample_file_names) {
chomp $_;
local *OUT_R1;
open( OUT_R1, ">", "$_\.fq" ) or die "cannot write file";
push @fh_array_R1, *OUT_R1;
}
# unknown barcode file
open( UNKNOWN_R1, ">unknown-barcode_SE.fq" ) or die "cannot create unknown-r1 file";
while ( defined( my $firstp1 = <IN_P1> ) ) {
my $p1_first_line = $firstp1;
my $p1_second_line = <IN_P1>;
my $p1_third_line = <IN_P1>;
my $p1_fourth_line = <IN_P1>;
chomp( $p1_first_line, $p1_second_line, $p1_third_line, $p1_fourth_line, );
my $matched_R1 = "$p1_first_line\n$p1_second_line\n$p1_third_line\n$p1_fourth_line\n";
for ( my $j = 0 ; $j < scalar @barcode1 ; $j++ ) {
chomp $barcode1[$j];
my $barcode1_regex = make_barcode_fragments( $barcode1[$j] );
if ( $p1_second_line =~ /$barcode1_regex/i ) {
# keep if matched
print { $fh_array_R1[$j] } $matched_R1;
last;
}
else {
#print to unknown;
print UNKNOWN_R1 $matched_R1;
}
}
}
# make two mismatch patterm of barcode
sub make_barcode_fragments {
my ($in1) = @_;
my @subpats;
for my $i ( 0 .. length($in1) - 1 ) {
for my $j ( $i + 1 .. length($in1) - 1 ) {
my $subpat = join( '',
substr( $in1, 0, $i ),
'\\w', substr( $in1, $i + 1, $j - $i - 1 ),
'\\w', substr( $in1, $j + 1 ),
);
push @subpats, $subpat;
}
}
my $pat = join( '|', @subpats );
#print $pat;
return "$pat";
}
exit;
答案 0 :(得分:1)
如果您的算法无法在Perl中进行更改/改进,您仍然可以通过在C中编写耗时的部分来获得加速。以下是使用内联C的示例:
use strict;
use warnings;
use Benchmark qw(timethese);
use Inline C => './check_line_c.c';
my $find = "MATCH1";
my $search = "saasdadadadadasd";
my %sub_info = (
c => sub { check_line_c( $find, $search ) },
perl => sub { check_line_perl( $find, $search ) },
);
timethese( 4_000_000, \%sub_info );
sub check_line_perl {
my ($find, $search ) = @_;
my $max_distance = 2;
for my $offset ( 0 .. length($search) - length($find) ) {
my $substr = substr( $search, $offset, length($find) );
my $hd = hd( $find, $substr );
if ( $hd <= $max_distance ) {
return ( $hd, $substr );
}
}
return ( undef, undef );
}
sub hd {
return ( $_[0] ^ $_[1] ) =~ tr/\001-\377//;
}
其中check_line_c.c
是:
void check_line_c( char* find, char * search ) {
int max_distance = 2;
int flen = strlen(find);
int last_ind = strlen(search) - flen;
SV *dis = &PL_sv_undef;
SV *match = &PL_sv_undef;
for ( int ind = 0; ind <= last_ind; ind++ )
{
int count = 0;
for ( int j = 0; j < flen; j++ )
{
if ( find[j] ^ search[ind+j] ) count++;
}
if ( count < max_distance )
{
match = newSV(flen);
sv_catpvn(match, search+ind, flen );
dis = newSViv(count);
break;
}
}
Inline_Stack_Vars;
Inline_Stack_Reset;
Inline_Stack_Push(sv_2mortal(dis));
Inline_Stack_Push(sv_2mortal(match));
Inline_Stack_Done;
}
输出是(使用Intel Core i7-4702MQ CPU @ 2.20GHz的Ubuntu笔记本电脑):
Benchmark: timing 4000000 iterations of c, perl...
c: 2 wallclock secs ( 0.76 usr + 0.00 sys = 0.76 CPU) @ 5263157.89/s (n=4000000)
perl: 19 wallclock secs (18.30 usr + 0.00 sys = 18.30 CPU) @ 218579.23/s (n=4000000)
所以这为这种情况提供了24倍的加速。
答案 1 :(得分:0)
我建议创建一个非常糟糕的哈希算法。一些好的,可逆的和低效的,就像人物的总和一样。或者也许是由字符表示的唯一值(1-4)的总和。
计算目标总和,并计算允许的最大差异。也就是说,如果目标是与两个替换匹配,那么最大可能差异是多少? (4-1 + 4-1 = 6)。
然后,对于每个&#34;窗口&#34;在目标数据文件中具有适当长度的文本,计算运行得分。 (在末尾添加一个字符,从头开始删除一个字符,更新哈希分数。)如果窗口的分数在允许范围内,则可以进一步调查。
您可能希望将其实现为不同的传递。可能甚至是shell管道或脚本中的不同阶段。这个想法是你可以并行化部分搜索。 (例如,一个进程可以搜索具有相同长度的所有匹配字符串,因为哈希窗口是相同的。)
当然,如果你的程序在后期崩溃,你可以保持早期工作是有益的。当你还在开发最终阶段时,你甚至可以让过程的早期部分运行。