使用snp位置修改Sequence并在同一文件中输出

时间:2012-10-29 15:26:36

标签: perl sed awk

我有两个文件,一个有位置信息,另一个是序列信息。现在我需要读取位置并在位置上取snps并用序列中的snp信息替换该位置基数并将其写入snp信息文件中...例如

Snp文件包含

10 A C A/C

序列文件包含

ATCGAACTCTACATTAC

这里第10个元素是T所以我用[A / C]替换T所以最终输出应该是

10 A C A/C ATCGAACTC[A/C]ACATTAC

示例文件

Snp文件

SNP Ref Alt
10  A   C
19  G   C
30  C   T
42  A   G

序列:

  

序列1   CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

最终输出:

SNP Ref Alt Output
10  A   C   CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19  G   C   CTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30  C   T   CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42  A   G   CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT

在这里从Ref和Alt列替换snps时,我们需要考虑{A,T,C,G}的顺序,就像[Ref / Alt]一样,第一个基数应该是A或T或C然后是那个订单。

另一件事是如果我们采用snp位置,并且如果有10个碱基差异的任何snps,我们需要用“N”替换该snp位置。在上面的例子中,在前两个位置,因为差值是9,我们用'N'替换另一个元素。

我已编写代码,按顺序打印位置,并用snp位置替换序列但无法读取附近位置并替换为N.

我的方法可能是完全错误的,因为我是编码的初学者。我认为通过使用哈希,我们可能很容易实现这一点,但我不太熟悉哈希...帮助请一些建议...我不必坚持只有perl,

my $input_file = $ARGV[0];
my $snp_file = $ARGV[1];
my $output_file = $ARGV[2];

%sequence_hash = ();

open SNP, $snp_file || die $!;
$indel_count = 0;
$snp_count = 0;
$total_count = 0;

#### hashes and array
@id_array = ();

while ($line = <SNP>) {

    next if $line =~ /^No/;
    $line =~ s/\r|\n//g;


   # if ($line =~ /\tINDEL/) {

    #    $indel_count++;
     #   $snp_type = "INDEL";

    #} else {
     #   $snp_count++;
      #  $snp_type = "SNP";
    #}

    @parts = split(/\t/,$line);

    $id = $parts[0];
    $pos = $parts[1];
    #$ref_base = $parts[3];
    @temp_ref = split(",",$parts[2]);
    $ref_base = $temp_ref[0];
    @alt = split(",",$parts[3]);
    $alt_base = $alt[0];
    $snpformat = '';

    if ($ref_base eq "A" || $alt_base eq "A")
    {

        if ($ref_base eq "A"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }

    elsif ($ref_base eq "T" || $alt_base eq "T")
    {

        if ($ref_base eq "T"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }

    elsif ($ref_base eq "C" || $alt_base eq "C")
    {

        if ($ref_base eq "C"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }


    else 
    {$snpformat = "-/-" ;}
    print " $id \t $pos \t $ref_base \t $alt_base \t $snpformat \n  ";
}

open SEQ, $input_file ||die $!;

$header = '';
$sequence = '';
$num_sequences = 0;

while ($line = <SEQ>) {

    $line =~ s/\r|\n//g;
    next if $line =~ //;

    if ($line =~ /^>(.+)/) {
        if ($header eq '') {

            $header = $1;
            $sequence = '';
            next;
        } else {

            $sequence_len = length($sequence);

            $sequence_hash{$header} = $sequence;
            push (@headers,$header);
            #print $header."\t".$sequence_len."\n";
            #print $sequence."\n";
            $num_sequences++;

            $header = $1;
            $sequence = '';

        }


    } else {

        $sequence .= $line;

    }

}
$sequence_len = length($sequence);
$sequence_hash{$header} = $sequence;
push (@headers,$header);
#print $header."\t".$sequence_len."\n";

$num_sequences++;

close (SEQ);

$pos = '4';
substr($sequence,$pos,1) = "[A/G]";
print $sequence."\n";   
print "$pos \n";

3 个答案:

答案 0 :(得分:1)

这个awk脚本可能会帮助您获得所需的结果。

awk '
BEGIN {
print "SNP\tRef\tAlt\tOutput"
}
NR==FNR { 
    a[++i]=$0
    next 
} 
FNR>1 { 
    x=substr(a[i],1,($1-1))
    z=substr(a[i],($1+1))
    if ($2=="A") {
        y="["$2"/"$3"]"
    } 
    else if ($2=="T" && $3=="A") {
        y="["$3"/"$2"]"
    }
    else if ($2=="C" && ($2=="A" || $2=="T")) {
        y="["$3"/"$2"]"
    }
    else if ($2=="G" && ($2=="A" || $2=="T" || $2=="C")) {
        y="["$3"/"$2"]"
    }
    else 
        y="["$3"/"$2"]"
    print $1"\t"$2"\t"$3"\t"x""y""z
}' sequence snp

测试:

[jaypal:~/temp] cat sequence
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

[jaypal:~/temp] cat snp
SNP Ref Alt
10  A   C
19  G   C
30  C   T
42  A   G

[jaypal:~/temp] awk '
BEGIN {
print "SNP\tRef\tAlt\tOutput"
}
NR==FNR { 
    a[++i]=$0
    next 
} 
FNR>1 { 
    x=substr(a[i],1,($1-1))
    z=substr(a[i],($1+1))
    if ($2=="A") {
        y="["$2"/"$3"]"
    } 
    else if ($2=="T" && $3=="A") {
        y="["$3"/"$2"]"
    }
    else if ($2=="C" && ($2=="A" || $2=="T")) {
        y="["$3"/"$2"]"
    }
    else if ($2=="G" && ($2=="A" || $2=="T" || $2=="C")) {
        y="["$3"/"$2"]"
    }
    else 
        y="["$3"/"$2"]"
        print $1"\t"$2"\t"$3"\t"x""y""z
}' sequence snp
SNP Ref Alt Output
10  A   C   CTAGAATCA[A/C]AGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19  G   C   CTAGAATCAAAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30  C   T   CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42  A   G   CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT

答案 1 :(得分:0)

我不是perl专家,但我认为这样做会:

#!/usr/bin/perl

open(SEQ, "seq");
my $seq = <SEQ>;
$seq =~ s/.* //;

print "SNP Ref Alt Output\n";
open(SNP, "snp");
<SNP>;# header line
while(<SNP>)
{
    my($line) = $_;
    chomp($line);
    my ($loc, $ref, $alt) = split(/ +/, $line);
    my $outString = $seq;
    substr($outString, $loc-1, 1, "[$ref/$alt]");
    print $loc."  ".$ref."   ".$alt."   ".$outString."\n";
}

答案 2 :(得分:0)

A=1;T=2;C=3;G=4
echo "SNP Ref Alt Output"
while read l1 l2 l3; do
    lp=$(($l1 - 1))
    eval ol2=\$$l2 && eval ol3=\$$l3
    if [[ $ol2 > $ol3 ]]; then 
        ol2=$l3 && ol3=$l2; 
    else 
        ol2=$l2 && ol3=$l3; 
    fi
    sed "s@[^ ]* \(.\{$lp\}\).\(.*\)@$l1  $l2   $l3   \1[$ol2\/$ol3]\2@" sequence
done < snp 

<强>输出

SNP Ref Alt Output
10  A   C   CTAGAATCA[A/C]AGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19  G   C   CTAGAATCAAAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30  C   T   CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42  A   G   CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT