Question

我应该解释这个问题的背景，我不知道任何Perl，并且对正则表达式有过度的过敏（我们都有自己的弱点）。我试图弄清楚为什么Perl程序不会接受我正在提供的数据。我不需要深入理解这个程序 - 我只是在进行时序比较。

考虑这个赋值语句：

($sample_ls_id) = $sample_ls_id =~ /:\w\w(\d+):/;

如果我理解正确，那就是检查sample_ls_id是否匹配某些正则表达式，如果是，则指定整个字符串，或类似的东西。

但是，我不明白这是如何运作的。根据文件，即perldoc perlretut，我简要地看了一下

$sample_ls_id =~ /:\w\w(\d+):/

如果匹配则返回true或false。

我想要匹配的字符串看起来像

1000    10      0       0       1        urn:lsid:dcc.hapmap.org:Individual:CEPH1000.10:1        urn:lsid:dcc.hapmap.org:Sample:SAMPLE1:1

此操作失败并显示错误

Use of uninitialized value $sample_ls_id in concatenation (.) or string
at database/populate/family.pl line 38, <INPUT> line 1.

第38行

print OUTPUT "$sample_ls_id\t$family_ped_id\t$individual_ped_id\t$father_ped_id\t$mother_ped_id\t$sex\t$created_by\t$population_code\n";

请参阅下面的完整脚本。但是，显然非常相似的字符串

1420    9       0       0       1       urn:lsid:dcc.hapmap.org:Individual:CEPH1420.09:1  urn:lsid:dcc.hapmap.org:Sample:NA12003:1

似乎过去了。

对于上下文，整段代码是：

use strict;
use warnings;
use Getopt::Long;

my $input_file = "data/family_ceu.txt";
my $output_file = "sql/family_ceu.sql";
my $population_code = "CEU";

GetOptions ('i=s' => \$input_file,
            'o=s' => \$output_file,
            'p=s' => \$population_code
            );

usagecheck();

my $created_by = 'gwas_analyzer';

print "Creating SQL file for inserting family data from $input_file\n";

open (INPUT, "< $input_file");
open (OUTPUT, "> $output_file");

print OUTPUT "INSERT INTO population (population_code, private) VALUES ('$population_code', 'f');\n";
print OUTPUT "COPY family (ls_id, family_ped_id, individual_ped_id, father_ped_id, mother_ped_id, sex, created_by, population_code) FROM stdin;                      
";

while (my $line = <INPUT>)
{
    chomp $line;

    #Skip any comment lines 
    next if($line =~ /^#/);

    my ($family_ped_id, $individual_ped_id, $father_ped_id, $mother_ped_id, $sex, $individual_ls_id, $sample_ls_id) = split (/\t/, $line);

    ($sample_ls_id) = $sample_ls_id =~ /:\w\w(\d+):/;

    print OUTPUT "$sample_ls_id\t$family_ped_id\t$individual_ped_id\t$father_ped_id\t$mother_ped_id\t$sex\t$created_by\t$population_code\n";
}

print OUTPUT "\\.\n";
close OUTPUT;

sub usagecheck
{
    if (!$input_file || !$output_file || !$population_code)
    {
        print "Missing argument (see required arguments below):\n";
        usage();
        exit;
    }
}

sub usage
{
    print "perl family.pl -i <input file> -o <output file> -p <population code>\n";
}

如果您了解正则表达式和Perl，我确定这是一个非常简单的问题。

Answer 1

$sample_ls_id = 'urn:lsid:dcc.hapmap.org:Sample:SAMPLE1:1';

时

正则表达式'/：\ w \ w（\ d +）：/;'失败。当字符串有一个冒号'：'后跟一个“单词”字符'\ w'时，这个正则表达式会传递，另一个“单词”字符'\ w'后跟一个或多个数字'\ d +'和冒号'：'。

$sample_ls_id = 'urn:lsid:dcc.hapmap.org:Sample:NA12003:1';

时

正则表达式'/：\ w \ w（\ d +）：/;'找到它的匹配 '：NA12003：'。（冒号，2个字的字符，数字和冒号）。

my $sample_id = 'urn:lsid:dcc.hapmap.org:Sample:NA12003:1'
($sample_ls_id) = $sample_ls_id =~ /:\w\w(\d+):/;

'（$ sample_ls_id）'捕获匹配的'（\ d +）'部分（也存储在$ 1中），在本例中为12003。

您在前面的示例中遇到错误，因为正则表达式失败并且'（$ sample_ls_id）'未定义。

Answer 2

在列表上下文中，例如对($sample_ls_id)的分配，=~返回捕获列表。它可以节省您在单独的声明中提取$1等。

Answer 3

而不是将字符串本身存回其本身，只需使用捕获。 \ d由$ 1持有，所以只需将代码更改为以下内容：

$sample_ls_id =~ /:\w\w(\d+):/; # no letters before implies "match"
$sample_ls_id = $1; # I assume that $1 will be empty if no match, I'm not 100% on this.

我不知道为什么你会收到你所得到的错误，但看起来你的代码会像上面那样更有意义。

如果输入没有最后一个元素（IE你有A：B：C但是你需要A：B：C：D来存储样本ls id中的D，它可能会有一些关系）如果D缺失则它从未被初始化，那么正则表达式就没有意义了。）

此外，我们没有所有代码（第38行看起来与你的while循环中的第一行相对应），如果你发布更多可能有用的代码。

调试perl赋值

3 个答案: