下面的代码循环遍历“/ data / results”目录中的文件夹,并将位于子文件夹(两个级别)中的每个.vcf文件名与matrix_key文件的内容进行匹配。
这似乎仅适用于第一个文件夹。我打印了每个@matrix_key的内容,这是正确的。代码始终无法匹配第二个文件夹。这是它无法匹配:: if ( my $aref = first { index($sample_id, $_->[1]) != -1 } @matrix_key ) {
我试过一次运行一个文件夹,效果很好。我在/data/results/
放置多个文件夹时,我不明白为什么会失败?有人可以建议如何纠正这个问题吗?谢谢。
以下是目录结构的示例:
/data/results/ TestFolder1/ subfolder1/Variants/MD-14-11856_RNA_v2.vcf subfoder2/Variants/SU-16-16117_RNA_v2.vcf matrix.txt matrixkey.txt TestFolder2/ subfolder1/Variants/SU-15-2542_v2.vcf subfolder2/Variants/SU-16-16117_v2.vcf matrix.txt matrixkey.txt
@matrix_key
的示例:
Barcode SampleName barcode_003 SU-15-2542 barcode-005 MD-14-11856 barcode-002 SU-16-16117
代码:
#!/usr/bin/perl
use warnings;
use strict;
use File::Copy qw(move);
use List::Util 'first';
use File::Find;
use File::Spec;
use Data::Dumper;
use File::Basename;
use File::Spec::Functions 'splitdir';
my $current_directory = "/data/results";
my @dirs = grep { -d } glob '/data/results/*';
if (grep -d, glob("$current_directory/*")) {
print "$current_directory has subfolder(s)\n";
}
else {
print "there are no folders\n";
die;
}
my %files;
my @matrix_key = ();
for my $dir ( @dirs ) {
print "the directory is $dir\n";
my $run_folder = (split '/', $dir)[3];
print "the folder is $run_folder\n";
my $key2 = $run_folder;
# checks if barcode matrix and barcode summary files exist
#shortens the folder names and unzips them.
#check if each sample is present in the matrix file for each folder.
my $location = "/data/results/".$run_folder;
my $matrix_key_file = "/data/results/".$run_folder."/matrixkey.txt";
open my $key, '<', $matrix_key_file or die $!; # key file
<$key>; # throw away header line in key file (first line)
@matrix_key = sort { length($b->[1]) <=> length($a->[1]) }
map [ split ], <$key>;
close $key or die $!;
print Dumper(@matrix_key) . "===\n\n";
find({ wanted => \&find_vcf, no_chdir=>1}, $location);
#find({ wanted => find_vcf, no_chdir=>1}, $location);
}
my $find_vcf = sub {
#sub find_vcf {
my $F = $File::Find::name;
if ($F =~ /vcf$/ ) {
print "$F\n";
$F =~ m|([^/]+).vcf$| or die "Can't extract Sample ID";
my $sample_id = $1; print "the short vcf name is: $sample_id\n";
if ( my $aref = first { index($sample_id, $_->[1]) != -1 } @matrix_key ) {
#the code fails to match sample_id to matrix_key
#even though it's printed out correctly
print "$sample_id \t MATCHES $aref->[1]\n";
print "\t$aref->[1]_$aref->[0]\n\n";
} else {
# handle all other possible exceptions
#print "folder name is $run_folder\n";
die("The VCF file doesn't match the Summary Barcode file: $sample_id\n");
}
}
}
答案 0 :(得分:2)
发布的代码似乎有点复杂。
这是我从问题中理解的一种方法。它使用File::Find::Rule
use warnings;
use strict;
use File::Find::Rule;
use List::Util 'any';
my $base_dir = '/data/results';
my @dirs = File::Find::Rule->maxdepth(1)->directory->in($base_dir);
foreach my $dir (@dirs)
{
# Find all .vcx files anywhere in this dir or below
my @vcx_files = File::Find::Rule->file->name('*.vcx')->in($dir);
# Remove the path and .vcx extension
my @names = map { m|.*/(.+)\.vcx$| } @vcx_files;
# Find all text files to search, right in this folder
my @files = File::Find::Rule ->
maxdepth(1)->file->name('*.txt')->in($dir);
foreach my $file (@files)
{
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>; # drop the header line
# Get the second field on each line (with SampleName)
my @samples = map { (split)[1] } <$fh>;
# ... search @samples for @names ...
}
}
将glob
用于上面的非递归搜索是可以的,但考虑到它对空格的处理,最好使用核心File::Glob代替它。
还有其他方法可以组织遍历目录和文件搜索,有很多方法可以比较两个列表。请澄清总体目标,以便我可以添加合适的代码来搜索.vcx
名称与文件内容。
请添加支票,修改变量名称,实施失败时的政策等等。