我是编程并尝试解决此问题的新手。我有这样的文件。
CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 77 T C T T T T T
tg93 79 C - C C C - -
tg93 79 C G C C C C G C
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 105 A G A A A A A G A
tg93 108 A G A A A A G A A
tg93 114 T C T T T T T C T
tg93 131 A C A A A A A A A
tg93 136 G C C G C C G G G
tg93 150 CTCTC - CTCTC - CTCTC CTCTC
在此文件中,标题为
CHROM - 名字 POS - 位置 REF - 参考 ALT - 替代 10 - 16_sample.bam - samplesd 我
现在我想知道REF和ALT栏中的字母出现了多少次。如果其中任何一个重复少于两次,我需要删除该行。
例如 在第一行中,我在REF中有'T',在ALT中有'C'。我在7个样本中看到,有5个T和2个空白,没有C.所以我需要删除这一行。
在第二行中,REF为'C',Alt为' - '。现在七个样本中我们有3个C,2个和2个空白。所以我们将这一行保持为C和 - 重复超过2次。 在计算
时,我们总是忽略空白过滤后的最终文件是
#CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 79 C - C C C - -
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 136 G C C G C C G G G
我能够读取数组中的列并在代码中显示它们,但我不知道如何启动循环来读取基数并计算它们的出现次数并保留列。任何人都可以告诉我应该如何处理这个问题?或者,如果您有任何我可以修改的示例代码,将会很有帮助。
答案 0 :(得分:2)
#!/usr/bin/env perl
use strict;
use warnings;
print scalar(<>); # Read and output the header.
while (<>) { # Read a line.
chomp; # Remove the newline from the line.
my ($chrom, $pos, $ref, $alt, @samples) =
split /\t/; # Parse the remainder of the line.
my %counts; # Count the occurrences of sample values.
++$counts{$_} for @samples; # e.g. Might end up with $counts{"G"} = 3.
print "$_\n" # Print line if we want to keep it.
if ($counts{$ref} || 0) >= 2 # ("|| 0" avoids a spurious warning.)
&& ($counts{$alt} || 0) >= 2;
}
输出:
CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 79 C - C C C - -
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 136 G C C G C C G G G
您在所需的输出中包含了108,但在七个样本中只有一个ALT实例。
用法:
perl script.pl file.in >file.out
或就地:
perl -i script.pl file
答案 1 :(得分:0)
这是一种不假设字段间标签分离的方法
use IO::All;
my $chrom = "tg93";
my @lines = io('file.txt')->slurp;
foreach(@lines) {
%letters = ();
# use regex with backreferences to extract data - this method does not depend on tab separated fields
if(/$chrom\s+\d+\s+([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])/) {
# initialize hash counts
$letters{$1} = 0;
$letters{$2} = 0;
# loop through the samples and increment the counter when matches are found
foreach($3, $4, $5, $6, $7, $8, $9) {
if ($_ eq $1) {
++$letters{$1};
}
if ($_ eq $2) {
++$letters{$2};
}
}
# if the counts for both POS and REF are greater than or equal to 2, print the line
if($letters{$1} >= 2 && $letters{$2} >= 2) {
print $_;
}
}
}