如何从文件中隔离数据块

时间:2013-12-19 14:06:54

标签: perl

我有一个像这样的文件

a score=-120.0
s Chicken.chr22      947 4 +   4081097 tgag
s Turkey.chrZ   31560312 4 -  81011772 ttct
s Mallard.apl2   2559751 4 - 153042893 TTCG

a score=61344.0
s Chicken.chr22                            951 15 +   4081097 c------tgggtgaagcactg
s Turkey.chrZ                         31560316 15 -  81011772 t------tgggtaaggaactg
s Mallard.apl2                         2559755 15 - 153042893 T------TGGGTTAGAAACTG
s Rock_pigeon.scaffold637               370291 15 +    418352 G------AGGGTCAGTTTCTG
s Common_cuckoo.scaffold569             739303 15 +   1009149 C------TGGGTTGAAAACTG
s Anna_s_hummingbird.scaffold44        3039342 15 -  10500161 C------TGGGTTAAACACTG
s Hoatzin.scaffold186                    66281 15 +    155126 C------TGGATAAAGAACTG
s Emperor_penguin.Scaffold155          7152296 15 -   9595628 C------TGGGTAAAAAATTG
s Adelie_penguin.scaffold207            570235 15 -   3061884 C------TGGGTCAAAAACTG
s Crested_ibis.scaffold108            24271571 15 -  27015053 C------TGAGTAAAAACCTG
s Little_egret.scaffold238              365328 14 +   1015180 -------TGGGTTAAAAACTG
s Peregrine_falcon.scaffold41_1        3239034 14 -   3351735 -------TGGGTTAAAAGCTG
s Budgerigar.megascaffold18            4987476 14 +  17573940 -------TGGATAAAGAACTG
s Golden_collared_manakin.scaffold312  1652783 16 +   1993610 A-----CAGGGTTAGGAACTG
s Downy_woodpecker.scaffold1064           9341 21 -    117330 AGTGAGGTGGATTGTGAACTG

每个数据块都有第一行,以a开头,其他行以s开头。之后,一个空行将块分开。

不幸的是,每个块包含不同数量的s行。

我想收集具有第一行的块(在具有相同格式的不同文件中)(以a开头)并且s行的数量将等于一个数字我将作为参数传递。

我编写了以下脚本,但它不起作用。有人可以帮我吗?

#!/usr/bin/perl
use strict;
#use warnings;

use POSIX;

my $maf     = $ARGV[0];
my $species = $ARGV[1];

#It filters the maf file. takes the blocks with all the species

open my $maf_file, $maf or die "Could not open $maf: $!";
my $count = 0;
my @array;

while (my $mline = <$maf_file>) {

  next if /^\s*#/;    #to avoid some lines with comments

  if ($mline =~ /^a/) {
    push(@array, $mline);
  }

  if ($mline =~ /^s/) {

    until ($mline != ~/\s/) {
      push(@array, $mline);
      $count += 1;
    }

    foreach (@array) {

      if ($count == $species) {
        print "$_\n";
      }
    }

    undef(@array);

  }

3 个答案:

答案 0 :(得分:1)

如果您有一个以块为单位组织的文件,您通常可以通过一种允许您逐块处理文件的方式更改Perl的输入记录分隔符。这是一般草图。

# You should enable these.
use strict;
use warnings;

# Change the input record separator.
# You typically want to make this change within a subroutine or other narrowly
# scoped location within your program.
local $/ = "\n\n";

while (my $block = <>){
    my @lines = split /\n/, $block;

    # Do stuff with the lines in a block.
}

答案 1 :(得分:0)

你还没有真正提出问题,所以很难得到很多帮助。但是如果你只想将每个块放入一个单独的数组元素中,那就非常简单了。您只需将$/设置为空字符串即可将Perl置于“段落模式”。

open my $maf_file, $maf or die "Could not open $maf: $!";
my @blocks;

{
  local $/ = ''; # always localise changes to Perl's special variables
  @blocks = <$maf_file>;
}

答案 2 :(得分:0)

我相信我已经解决了它,基于FMc的帮助。 非常感谢你!

#!/usr/bin/perl

use strict;
use POSIX;

my $maf = $ARGV[0];
my $species = $ARGV[1];
my $nline = 0;

if ($species == "" || $species == "0") {
$species = 1;
#print "Forching number of species to 1\n";
}
open (FILE, $maf) or die("foo");

local $/ = "\n\n";

while (<FILE>){
my @lines = split /\n/, <>;
my $arraySize = @lines;
foreach (@lines) {
 if ($arraySize == $species +1 ) {
    print "$_\n";
    $nline = 1;
 }
}
if ($nline == 1) {
    print"\n";
    $nline = 0;
}

}