在perl

时间:2016-11-06 04:40:48

标签: perl sorting pointers

我正在尝试在perl中使用某种指针,以便我可以查看两个按字母顺序排序的两个文件,如果它们在第一列中具有相同的名称,则匹配两个文件中的内容。我正在搜索每个文件的方式是我正在查看第一列按字母顺序排在哪一行,然后将该文件上的指针移动到下一行。有点类似于合并排序中的指针。下面的代码是我想要的一个例子。

使用这两个文件。

SET1

 apple  17  20
 boombox  23  29
 carl  25  29
 cat  22  33
 dog  27  44

SET2

 ants  yes
 boombox  no
 carl  yes
 dentist  yes
 dice  no
 dog  no

我可以创建一个类似这样的脚本

($name, $affirmation) = first line in set2; #part I'm confused about I just kind of need some sort of command of something that will do this
while (<>){
        @set1 = split;
        while ($name < set1[0]){
             ($name, $affirmation) = next line in set2;  # part i'm confused about I just kind of need some sort of command of something that will do this
        }
        if ($name = $set[0]{
               print @set1, $affirmation;
        }

这是我如何运行它

./script.txt set1

我最终会以

结束
boombox  23  29  no
carl  25  29  yes
dog  27  44  no

编辑:

我在一些答案中尝试了一些代码,看看我是否可以用它来制作一些功能代码,但我似乎遇到了问题,而且答案中的一些语法我无法理解所以我有找出如何调试或解决这个问题很麻烦。

这是我使用以下两个文本文件的具体示例

的text.txt

Apples 0       -1      -1      0       0       0       0       -1 
Apricots 0       1      1      0       0       0       0       1
Fruit        0       -1      -1      0       0       0       0       -1
Grapes        0       -2      -1      0       0       0       0       -2
Oranges   0       1      1      0       0       0       0       -1
Peaches  0       -2      -1      0       0       0       0       -2

text2.txt

Apples      CHR1    +       1167628 1170420 1       1       N
Apricots      CHR1    -       2115898 2144159 1       1       N
Oranges       CHR1    -       19665266        19812066        1       1      N
Noidberry     CHR1    -       1337728 1329993 1       1       N
Peaches       CHR1    -       1337275 1342693 1       1       N

这个脚本 script.pl

#!/usr/bin/perl
use warnings;
my $file_1 = $ARGV[0];
my $file_2 = $ARGV[1];

open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";

open(my $single, '>', 'text.txt');
open(my $deep, '>', 'text2.txt');
OUTER: while (my $outer = <$fh1>){
        chomp $outer;
        @CopyNumber = split(' ', $outer);
        ($title, $title2) = split('\|', $CopyNumber[0]);
        #print 'title: ',$title,' title2: ',$title2,"\n";
        my $numLoss = 0;
        my $deepLoss = 0;
        for ($i = 1; $i <= $#CopyNumber; $i++){
                #print "$CopyNumber[$i], $#CopyNumber, $i, \n";
                if ($CopyNumber[$i] < 0){
                        $numLoss = $numLoss + 1;
                        if ($CopyNumber[$i] <-1){
                                $deepLoss = $deepLoss + 1;
                        }
                }
        }
        if ($GeneSym and (($GeneSym cmp $title)==0)){ #or (($GeneSym cmp $title2)==0))){
                print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
                print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
                next OUTER;
        }

        INNER: while (my $inner = <$fh2>){
                ($GeneSym, $Chrom, $Strand, $Start, $Stop, $MapId, $TotalMap, $AbnormalMerge, $Overlap) = split(' ', $inner);
                $Chrom =~ s/CHR/hs/ee;
                my $cmp = ($GeneSym cmp $title);
                next OUTER if $cmp < 0;
                if ($cmp==0){ #or (($GeneSym cmp $title2)==0)){
                        print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
                        print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
                        next OUTER;
                }
        }
}

如果我运行./script.pl text.txt text2.txt我应该打印到Number.txt

//对应于text2.txt的第2,4,5列,最后一列是数字小于0的列的百分比

hs1     1167628     1170420    0.375 //For Apples
hs1     2115898     2144159    0 //For Apricots
hs1     19665266    19812066   0.125 //For Oranges
hs1     1337275     1342693    0.375 //For Peaches

相反,我得到了这个

hs1     1167628 1170420 0.375
hs1     2115898 2144159 0
hs1     1337275 1342693 0.375

所以我只是得到一个错误

hs1     19665266    19812066   0.125 //For Oranges

未打印

2 个答案:

答案 0 :(得分:3)

非常像您所述:使用cmp进行比较,split分为两个术语。

对于FILE1文件的每一行,请查看FILE2文件的行,找到匹配项后退出。一旦FILE2按字母顺序移动到下一行FILE1

use warnings 'all';
use strict;

sub process {
   my ($name, $affirm_1, $affirm_2) = @_;
   print "$name $affirm_1 $affirm_2\n";
}

my $file_1 = 'set1.txt';
my $file_2 = 'set2.txt';

open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";

my ($name_2, $affirm_2);
FILE1: while (my $line1 = <$fh1>) {
    chomp $line1;
    my ($name_1, $affirm_1) = split ' ', $line1, 2;

    if ($name_2) {
        my $cmp = $name_1 cmp $name_2;
        next FILE1 if $cmp < 0;
        if ($cmp == 0) {
            process($name_1, $affirm_1, $affirm_2);
            next FILE1;
        }
    }

    FILE2: while (my $line2 = <$fh2>) {
        chomp $line2;
        ($name_2, $affirm_2) = split ' ', $line2, 2;
        my $cmp = $name_1 cmp $name_2;
        next FILE1 if $cmp < 0;
        if ($cmp == 0) {
            process($name_1, $affirm_1, $affirm_2);
            next FILE1;
        }
    }
}

对一些剩余细节的评论。

一旦FILE2行&#34;过冲,&#34;在FILE1的下一次迭代中,我们需要首先检查该行,然后进入FILE2循环以迭代剩余的行。对于第一个FILE1行,$name_2仍未取消if ($name_2)

针对已修改的帖子进行了更新。

use warnings 'all';
use strict;

sub process_line {
    my ($single, $deep, $rline, $GeneSym, $Chrom, $Start, $Stop) = @_;
    my ($numLoss, $deepLoss) = calc_loss($rline);
    $Chrom =~ s/CHR/hs/;
    print $single (join "\t", $Chrom, $Start, $Stop, $numLoss/$#$rline), "\n";
    print $deep   (join "\t", $Chrom, $Start, $Stop, $deepLoss/$#$rline), "\n";
}

sub calc_loss {
    my ($rline) = @_; 
    my ($numLoss, $deepLoss) = (0, 0); 
    for my $i (1.. $#$rline) {
        $numLoss  += 1  if $rline->[$i] < 0;
        $deepLoss += 1  if $rline->[$i] < -1; 
    }   
    return $numLoss, $deepLoss;
}

my ($Number,  $NumberDeep) = ('Number.txt', 'NumberDeep.txt');
open my $single, '>', $Number      or die "Can't open $Number: $!";
open my $deep,   '>', $NumberDeep  or die "Can't open $NumberDeep: $!";

my ($file_1, $file_2) = ('set1_new.txt', 'set2_new.txt');    
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";

my ($GeneSym, $Chrom, $Strand, $Start, $Stop, 
    $MapId, $TotalMap, $AbnormalMerge, $Overlap);

FILE1: while (my $line1 = <$fh1>) {
    next if $line1 =~ /^\s*$/;
    chomp $line1;

    my @line = split ' ', $line1;

    if ($GeneSym) {
        my $cmp = $line[0] cmp $GeneSym;
        next FILE1 if $cmp < 0;
        if ($cmp == 0) {
            process_line($single, $deep, \@line,
                         $GeneSym, $Chrom, $Start, $Stop);
            next FILE1;
        }   
    }   

    FILE2: while (<$fh2>) {
        next if /^\s*$/;
        chomp;
        ($GeneSym, $Chrom, $Strand, $Start, $Stop, 
             $MapId, $TotalMap, $AbnormalMerge, $Overlap) = split;
        my $cmp = $line[0] cmp $GeneSym;
        next FILE1 if $cmp < 0;
        if ($cmp == 0) {
            process_line($single, $deep, \@line,
                         $GeneSym, $Chrom, $Start, $Stop);
            next FILE1;
        }
    }
}

这会使用给定的示例文件生成所需的输出。我们会采取一些快捷方式,如果评论有用,请告诉我。这里有几个

  • 应该添加很多错误检查。

  • 我假设FILE1的第一个字段按原样使用。否则需要进行更改。

  • 处理分为两个功能,计算是分开的。这不是必需的。

  • $#$rline$rline arrayref的最后一个元素的索引。如果使用@$rline - 1语法过多,例如(0..@$rline-1)

对问题中发布的代码的一些评论:

  • 始终始终,请use warnings;(和use strict;

  • 循环索引最好写成foreach my $i (0..$#array)

  • 正则表达式修饰符/ee非常复杂。这里绝对没有必要。

答案 1 :(得分:2)

你是对的。它与合并排序完全相同,只是输出匹配的行。

sub read_and_parse1 {
   my ($fh) = @_;
   defined( my $line = <$fh> )
      or return undef;

   my ($id, @copy) = split(' ', $line);   # Use split(/\t/, $line) if tab-separated data
   my ($gene_sym) = split(/\|/, $id);

   return [ $gene_sym, @copy ];
}

sub read_and_parse2 {
   my ($fh) = @_;
   defined( my $line = <$fh> )
      or return undef;

   return [ split(' ', $line) ];         # Use split(/\t/, $line) if tab-separated data
}

my $fields1 = read_and_parse1($fh1);
my $fields2 = read_and_parse2($fh2);
while ($fields1 && $fields2) {
   my $cmp = $fields1->[0] cmp $fields2->[0];
   if    ($cmp < 0) { $fields1 = read_and_parse1($fh1); }
   elsif ($cmp > 0) { $fields2 = read_and_parse2($fh2); }
   else {
      my ($gene_sym, @copy) = @$fields1;
      my (undef, $chrom, $strand, $start, $stop, $map_id, $total_map, $abnormal_merge, $overlap) = @$fields2;

      $chrom =~ s/^CHR/hs/;

      my $num_loss  = grep { $_ <  0 } @copy;
      my $deep_loss = grep { $_ < -1 } @copy;

      print($single_fh join("\t", $chrom, $start, $stop, $num_loss/@copy  ) . "\n");
      print($deep_fh   join("\t", $chrom, $start, $stop, $deep_loss/@copy ) . "\n");

      $fields1 = read_and_parse1($fh1);
      $fields2 = read_and_parse2($fh2);
   }
}

输出:

$ cat single.txt
hs1 1167628 1170420 0.375
hs1 2115898 2144159 0
hs1 19665266 19812066 0.125
hs1 1337275 1342693 0.375

$ cat deep.txt
hs1 1167628 1170420 0
hs1 2115898 2144159 0
hs1 19665266 19812066 0
hs1 1337275 1342693 0.25