读取制表符分隔文件并计算出现次数和删除行

时间:2012-09-18 16:18:19

标签: perl awk

我是编程并尝试解决此问题的新手。我有这样的文件。

CHROM    POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    77  T   C   T   T   T   T           T
tg93    79  C   -   C       C   C   -   -   
tg93    79  C   G   C   C   C   C   G       C
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    105 A   G   A   A   A   A   A   G   A
tg93    108 A   G   A   A   A   A   G   A   A
tg93    114 T   C   T   T   T   T   T   C   T
tg93    131 A   C   A   A   A   A   A   A   A
tg93    136 G   C   C   G   C   C   G   G   G
tg93    150 CTCTC   -       CTCTC       -   CTCTC       CTCTC

在此文件中,标题为

CHROM - 名字 POS - 位置 REF - 参考 ALT - 替代 10 - 16_sample.bam - samplesd 我

现在我想知道REF和ALT栏中的字母出现了多少次。如果其中任何一个重复少于两次,我需要删除该行。

例如 在第一行中,我在REF中有'T',在ALT中有'C'。我在7个样本中看到,有5个T和2个空白,没有C.所以我需要删除这一行。

在第二行中,REF为'C',Alt为' - '。现在七个样本中我们有3个C,2个和2个空白。所以我们将这一行保持为C和 - 重复超过2次。 在计算

时,我们总是忽略空白

过滤后的最终文件是

#CHROM   POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    79  C   -   C       C   C   -   -   
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    136 G   C   C   G   C   C   G   G   G

我能够读取数组中的列并在代码中显示它们,但我不知道如何启动循环来读取基数并计算它们的出现次数并保留列。任何人都可以告诉我应该如何处理这个问题?或者,如果您有任何我可以修改的示例代码,将会很有帮助。

2 个答案:

答案 0 :(得分:2)

#!/usr/bin/env perl
use strict;
use warnings;

print scalar(<>);                   # Read and output the header.

while (<>) {                        # Read a line.
   chomp;                           # Remove the newline from the line.
   my ($chrom, $pos, $ref, $alt, @samples) =
      split /\t/;                   # Parse the remainder of the line.

   my %counts;                      # Count the occurrences of sample values.
   ++$counts{$_} for @samples;      # e.g. Might end up with $counts{"G"} = 3.

   print "$_\n"                     # Print line if we want to keep it.
      if ($counts{$ref} || 0) >= 2  # ("|| 0" avoids a spurious warning.)
      && ($counts{$alt} || 0) >= 2;
}

输出:

CHROM    POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    79  C   -   C       C   C   -   -   
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    136 G   C   C   G   C   C   G   G   G

您在所需的输出中包含了108,但在七个样本中只有一个ALT实例。

用法:

perl script.pl file.in >file.out

或就地:

perl -i script.pl file

答案 1 :(得分:0)

这是一种不假设字段间标签分离的方法

use IO::All;
my $chrom = "tg93";
my @lines = io('file.txt')->slurp;
foreach(@lines) {
    %letters = ();

    # use regex with backreferences to extract data - this method does not depend on tab separated fields
    if(/$chrom\s+\d+\s+([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])/) {

        # initialize hash counts
        $letters{$1} = 0;
        $letters{$2} = 0;

        # loop through the samples and increment the counter when matches are found
        foreach($3, $4, $5, $6, $7, $8, $9) {
            if ($_ eq $1) {
                ++$letters{$1};
            }
            if ($_ eq $2) {
                ++$letters{$2};
            }
        } 

        # if the counts for both POS and REF are greater than or equal to 2, print the line
        if($letters{$1} >= 2 && $letters{$2} >= 2) {
            print $_;
        }
    }
}