我有两个文件,一个有位置信息,另一个是序列信息。现在我需要读取位置并在位置上取snps并用序列中的snp信息替换该位置基数并将其写入snp信息文件中...例如
Snp文件包含
10 A C A/C
序列文件包含
ATCGAACTCTACATTAC
这里第10个元素是T所以我用[A / C]替换T所以最终输出应该是
10 A C A/C ATCGAACTC[A/C]ACATTAC
示例文件
Snp文件
SNP Ref Alt
10 A C
19 G C
30 C T
42 A G
序列:
序列1 CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
最终输出:
SNP Ref Alt Output
10 A C CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19 G C CTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30 C T CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42 A G CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT
在这里从Ref和Alt列替换snps时,我们需要考虑{A,T,C,G}的顺序,就像[Ref / Alt]一样,第一个基数应该是A或T或C然后是那个订单。
另一件事是如果我们采用snp位置,并且如果有10个碱基差异的任何snps,我们需要用“N”替换该snp位置。在上面的例子中,在前两个位置,因为差值是9,我们用'N'替换另一个元素。
我已编写代码,按顺序打印位置,并用snp位置替换序列但无法读取附近位置并替换为N.
我的方法可能是完全错误的,因为我是编码的初学者。我认为通过使用哈希,我们可能很容易实现这一点,但我不太熟悉哈希...帮助请一些建议...我不必坚持只有perl,
my $input_file = $ARGV[0];
my $snp_file = $ARGV[1];
my $output_file = $ARGV[2];
%sequence_hash = ();
open SNP, $snp_file || die $!;
$indel_count = 0;
$snp_count = 0;
$total_count = 0;
#### hashes and array
@id_array = ();
while ($line = <SNP>) {
next if $line =~ /^No/;
$line =~ s/\r|\n//g;
# if ($line =~ /\tINDEL/) {
# $indel_count++;
# $snp_type = "INDEL";
#} else {
# $snp_count++;
# $snp_type = "SNP";
#}
@parts = split(/\t/,$line);
$id = $parts[0];
$pos = $parts[1];
#$ref_base = $parts[3];
@temp_ref = split(",",$parts[2]);
$ref_base = $temp_ref[0];
@alt = split(",",$parts[3]);
$alt_base = $alt[0];
$snpformat = '';
if ($ref_base eq "A" || $alt_base eq "A")
{
if ($ref_base eq "A"){
$snpformat = "[".join("/",$ref_base,$alt_base)."]";}
#$snpformat = $ref_base/$alt_base;}
#print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }
else
{$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
#print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
}
elsif ($ref_base eq "T" || $alt_base eq "T")
{
if ($ref_base eq "T"){
$snpformat = "[".join("/",$ref_base,$alt_base)."]";}
#$snpformat = $ref_base/$alt_base;}
#print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }
else
{$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
#print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
}
elsif ($ref_base eq "C" || $alt_base eq "C")
{
if ($ref_base eq "C"){
$snpformat = "[".join("/",$ref_base,$alt_base)."]";}
#$snpformat = $ref_base/$alt_base;}
#print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }
else
{$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
#print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
}
else
{$snpformat = "-/-" ;}
print " $id \t $pos \t $ref_base \t $alt_base \t $snpformat \n ";
}
open SEQ, $input_file ||die $!;
$header = '';
$sequence = '';
$num_sequences = 0;
while ($line = <SEQ>) {
$line =~ s/\r|\n//g;
next if $line =~ //;
if ($line =~ /^>(.+)/) {
if ($header eq '') {
$header = $1;
$sequence = '';
next;
} else {
$sequence_len = length($sequence);
$sequence_hash{$header} = $sequence;
push (@headers,$header);
#print $header."\t".$sequence_len."\n";
#print $sequence."\n";
$num_sequences++;
$header = $1;
$sequence = '';
}
} else {
$sequence .= $line;
}
}
$sequence_len = length($sequence);
$sequence_hash{$header} = $sequence;
push (@headers,$header);
#print $header."\t".$sequence_len."\n";
$num_sequences++;
close (SEQ);
$pos = '4';
substr($sequence,$pos,1) = "[A/G]";
print $sequence."\n";
print "$pos \n";
答案 0 :(得分:1)
这个awk
脚本可能会帮助您获得所需的结果。
awk '
BEGIN {
print "SNP\tRef\tAlt\tOutput"
}
NR==FNR {
a[++i]=$0
next
}
FNR>1 {
x=substr(a[i],1,($1-1))
z=substr(a[i],($1+1))
if ($2=="A") {
y="["$2"/"$3"]"
}
else if ($2=="T" && $3=="A") {
y="["$3"/"$2"]"
}
else if ($2=="C" && ($2=="A" || $2=="T")) {
y="["$3"/"$2"]"
}
else if ($2=="G" && ($2=="A" || $2=="T" || $2=="C")) {
y="["$3"/"$2"]"
}
else
y="["$3"/"$2"]"
print $1"\t"$2"\t"$3"\t"x""y""z
}' sequence snp
[jaypal:~/temp] cat sequence
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
[jaypal:~/temp] cat snp
SNP Ref Alt
10 A C
19 G C
30 C T
42 A G
[jaypal:~/temp] awk '
BEGIN {
print "SNP\tRef\tAlt\tOutput"
}
NR==FNR {
a[++i]=$0
next
}
FNR>1 {
x=substr(a[i],1,($1-1))
z=substr(a[i],($1+1))
if ($2=="A") {
y="["$2"/"$3"]"
}
else if ($2=="T" && $3=="A") {
y="["$3"/"$2"]"
}
else if ($2=="C" && ($2=="A" || $2=="T")) {
y="["$3"/"$2"]"
}
else if ($2=="G" && ($2=="A" || $2=="T" || $2=="C")) {
y="["$3"/"$2"]"
}
else
y="["$3"/"$2"]"
print $1"\t"$2"\t"$3"\t"x""y""z
}' sequence snp
SNP Ref Alt Output
10 A C CTAGAATCA[A/C]AGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19 G C CTAGAATCAAAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30 C T CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42 A G CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT
答案 1 :(得分:0)
我不是perl专家,但我认为这样做会:
#!/usr/bin/perl
open(SEQ, "seq");
my $seq = <SEQ>;
$seq =~ s/.* //;
print "SNP Ref Alt Output\n";
open(SNP, "snp");
<SNP>;# header line
while(<SNP>)
{
my($line) = $_;
chomp($line);
my ($loc, $ref, $alt) = split(/ +/, $line);
my $outString = $seq;
substr($outString, $loc-1, 1, "[$ref/$alt]");
print $loc." ".$ref." ".$alt." ".$outString."\n";
}
答案 2 :(得分:0)
A=1;T=2;C=3;G=4
echo "SNP Ref Alt Output"
while read l1 l2 l3; do
lp=$(($l1 - 1))
eval ol2=\$$l2 && eval ol3=\$$l3
if [[ $ol2 > $ol3 ]]; then
ol2=$l3 && ol3=$l2;
else
ol2=$l2 && ol3=$l3;
fi
sed "s@[^ ]* \(.\{$lp\}\).\(.*\)@$l1 $l2 $l3 \1[$ol2\/$ol3]\2@" sequence
done < snp
<强>输出强>
SNP Ref Alt Output
10 A C CTAGAATCA[A/C]AGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19 G C CTAGAATCAAAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30 C T CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42 A G CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT