Perl代码无法递归匹配(嵌套的subs)

时间:2017-01-17 15:21:24

标签: perl

下面的代码循环遍历“/ data / results”目录中的文件夹,并将位于子文件夹(两个级别)中的每个.vcf文件名与matrix_key文件的内容进行匹配。

这似乎仅适用于第一个文件夹。我打印了每个@matrix_key的内容,这是正确的。代码始终无法匹配第二个文件夹。这是它无法匹配:: if ( my $aref = first { index($sample_id, $_->[1]) != -1 } @matrix_key ) {

的地方

我试过一次运行一个文件夹,效果很好。我在/data/results/放置多个文件夹时,我不明白为什么会失败?有人可以建议如何纠正这个问题吗?谢谢。

以下是目录结构的示例:

/data/results/
    TestFolder1/
        subfolder1/Variants/MD-14-11856_RNA_v2.vcf
        subfoder2/Variants/SU-16-16117_RNA_v2.vcf
        matrix.txt
        matrixkey.txt

    TestFolder2/
        subfolder1/Variants/SU-15-2542_v2.vcf
        subfolder2/Variants/SU-16-16117_v2.vcf
        matrix.txt
        matrixkey.txt

@matrix_key的示例:

Barcode        SampleName
barcode_003    SU-15-2542
barcode-005    MD-14-11856
barcode-002    SU-16-16117

代码:

#!/usr/bin/perl
use warnings;
use strict;

use File::Copy qw(move);
use List::Util 'first';
use File::Find;
use File::Spec;
use Data::Dumper;

use File::Basename;
use File::Spec::Functions 'splitdir';

my $current_directory = "/data/results";
my @dirs = grep { -d } glob '/data/results/*';

if (grep -d, glob("$current_directory/*")) {
    print "$current_directory has subfolder(s)\n";
}
else {
    print "there are no folders\n";
    die;
}

my %files;

my @matrix_key = (); 

for my $dir ( @dirs ) { 

    print "the directory is $dir\n";
    my $run_folder = (split '/', $dir)[3];
    print "the folder is $run_folder\n";

    my $key2 = $run_folder;

    # checks if barcode matrix and barcode summary files exist  

    #shortens the folder names and unzips them.

    #check if each sample is present in the matrix file for each folder.
    my $location = "/data/results/".$run_folder;

    my $matrix_key_file = "/data/results/".$run_folder."/matrixkey.txt";

    open my $key, '<', $matrix_key_file or die $!; # key file

    <$key>; # throw away header line in key file (first line)

    @matrix_key = sort { length($b->[1]) <=> length($a->[1]) } 
                  map [ split ], <$key>;
    close $key or die $!;

    print Dumper(@matrix_key) . "===\n\n";

    find({ wanted => \&find_vcf, no_chdir=>1}, $location);
    #find({ wanted => find_vcf, no_chdir=>1}, $location);
}

my $find_vcf = sub {
    #sub find_vcf {
    my $F = $File::Find::name;

    if ($F =~ /vcf$/ ) {
        print "$F\n";

        $F =~ m|([^/]+).vcf$| or die "Can't extract Sample ID";
        my $sample_id = $1; print "the short vcf name is: $sample_id\n";

        if ( my $aref = first { index($sample_id, $_->[1]) != -1 } @matrix_key ) {
            #the code fails to match sample_id to matrix_key
            #even though it's printed out correctly

            print "$sample_id \t MATCHES $aref->[1]\n";
            print "\t$aref->[1]_$aref->[0]\n\n";

        } else {
            # handle all other possible exceptions

            #print "folder name is $run_folder\n";

            die("The VCF file doesn't match the Summary Barcode file: $sample_id\n");
        }

    }
}

1 个答案:

答案 0 :(得分:2)

发布的代码似乎有点复杂。

这是我从问题中理解的一种方法。它使用File::Find::Rule

use warnings;
use strict;
use File::Find::Rule;
use List::Util 'any';

my $base_dir = '/data/results';    
my @dirs = File::Find::Rule->maxdepth(1)->directory->in($base_dir);

foreach my $dir (@dirs) 
{
    # Find all .vcx files anywhere in this dir or below
    my @vcx_files = File::Find::Rule->file->name('*.vcx')->in($dir);

    # Remove the path and .vcx extension
    my @names = map { m|.*/(.+)\.vcx$| } @vcx_files;

    # Find all text files to search, right in this folder
    my @files = File::Find::Rule ->
        maxdepth(1)->file->name('*.txt')->in($dir);

    foreach my $file (@files)
    {
        open my $fh, '<', $file  or die "Can't open $file: $!";
        <$fh>;  # drop the header line
        # Get the second field on each line (with SampleName)
        my @samples = map { (split)[1] } <$fh>;

        # ... search @samples for @names ...
     }
}

glob用于上面的非递归搜索是可以的,但考虑到它对空格的处理,最好使用核心File::Glob代替它。

还有其他方法可以组织遍历目录和文件搜索,有很多方法可以比较两个列表。请澄清总体目标,以便我可以添加合适的代码来搜索.vcx名称与文件内容。

请添加支票,修改变量名称,实施失败时的政策等等。