Question

我正在寻找有关提取字符串部分的建议，该字符串始终作为使用perl和regex的括号之间的第一个实例数据，并将该值分配给变量。

这是精确的情况，我使用perl和regex从大学目录中提取courseID并将其分配给变量。请考虑以下事项：

BIO-2109-01（12345）生物学概论
CHM-3501-F2-01（54321）化学概论
IDS-3250-01（98765）美国历史（1860-2000）
SPN-1234-02-F1（45678）西班牙历史（1900-2010）

典型的格式是[course-section-name] [（courseID）] [courseName]

我的目标是创建一个脚本，该脚本可以一次一个地输入每个条目，将其分配给变量，然后使用正则表达式仅提取courseID并仅将courseID分配给变量。

我的方法是使用搜索和替换来替换与＆＃39;＆＃39;然后将剩下的内容（courseID）保存到变量中。以下是我尝试过以下内容的几个示例：

$string = "BIO-2109-01 (12345) Introduction to Biology";
($courseID = $string) =~ s/[^\d\d\d\d\d]//g;
print $courseID;

结果：21090112345 ---打印course-section-name和courseID

$string = "BIO-2109-01 (12345) Introduction to Biology";
$($courseID = $string) =~ s/[^\b\(\d{5}\)]\b//g;
print $courseID;

结果：210901（12345）---打印课程 - 部分名称，parens和courseID

所以我在搜索和替换方面没有多少运气 - 但是我找到了这个金块：

\(([^\)]+)\)

在与{parens部分相匹配的http://regexr.com/上。但是，它也会匹配多个parans，包括例如（abc）。

我现在还不确定如何做这样的事情：

$string = "BIO-2109-01 (12345) Introduction to Biology";
($courseID = $string) =~ [magicRegex_goes_here];
print courseID;

结果12345

或者，更好：

$string = IDS-3250-01 (98765) History of US (1860-2000)
($courseID = $string) =~ [magicRegex_goes_here];
print courseID;

结果98765

任何建议或指示都将不胜感激。我已经尝试了所有我知道的东西，可以研究正则表达式来解决这个问题。如果我可以提供更多信息，请随时提出。

更新

use warnings 'all';
use strict;
use feature 'say';

my $file = './data/enrollment.csv';      #File this script generates
my $course = "";                         #Complete course string [name-of-course] [(courseID)] [course_name]
my @arrayCourses = "";                   #Array of courseIDs
my $i = "";                              #i in for loop
my $courseID = "";                       #Extracted course ID
my $userName = "";                       #Username of person we are enrolling
my $action = "add,";                     #What we are doing to user
my $permission = "teacher,";             #What permissions to assign to user
my $stringToPrint = "";                  #Concatinated string to write to file
my $n = "\n";                            #\n
my $c = ",";                             #,

#BEGIN PROGRAM

print "Enter the username \n";

chomp($userName = <STDIN>);               #Get the enrollee username from user

print "\n";

print "Enter course name and press enter.  Enter 'x' to end. \n";  #prompt for course names

while ($course ne 'x') {
        chomp($course = <STDIN>);
        if ($course ne "x") {
                if (($courseID) = ($course =~ /[^(]+\(([^)]+)\)/) ) {     #nasty regex to extract courseID - thnx PerlDuck and zdim
                        push @arrayCourses, $courseID;                    #put the courseID into array
                }
                else {
                        print "Cannot process last entry check it";
                }
        }
        else {
                last;
        }
}

shift @arrayCourses;                      #Remove first entry from array - add,teacher,,username

open(my $fh,'>', $file);                  #open file

for $i (@arrayCourses)                    #write array to file
{
        $stringToPrint= join "", $action, $permission, $i, $c, $userName, $n ;
        print $fh $stringToPrint;
}

close $fh;

那就做了！欢迎提出建议或改进！谢谢@PerlDuck和@zdim

Answer 1

因为你确定了格式

my ($section, $id, $name) = 
    $string =~ /^\s* ([^(]+) \(\s* ([^)]+) \)\s* (.+) $/x;

这里的关键是否定字符类，[^...]，它匹配^之后列出的字符以外的任何一个字符（使其“否定”）。未转义的括号捕获匹配，但在字符类[]中除外，它们被视为文字。

它首先匹配除(以外的所有连续字符，因此最后是第一个(，它周围的( )对捕获的是什么。然后除)之外的其他所有内容，直到第一个关闭的paren，也由其自己的( )对捕获。这是在\( ... \)之外的文字括号( )之间，因为我们不希望它们被捕获。然后捕获所有其余内容，(.+)，至少需要一些字符，因为+表示一个或更多。请注意，这些可以是空格。我们从第一次捕获中排除可能的前导空格，通过在捕获括号之前专门匹配它，并在id-parenthesis周围提取（一些）可能的空格。

/x修饰符允许在里面使用空格（以及注释和换行符），这有助于实现可重用性。匹配运算符返回所有匹配的列表，我们将其分配给变量。请注意，即使只有一个匹配，它仍然会返回（作为）列表。请参阅Regular Expressions Tutorial (perlretut)。

然后，假设您在文件中有目录

use warnings 'all';
use strict;
use feature 'say';

my $file = 'catalog.txt';

open my $fh, '<', $file or die "Can't open $file: $!";

while (my $line = <$fh>) 
{
    next if $line =~ /^\s*$/;  # skip empty lines

    # Strip leading and trailing white space
    $line =~ s{^\s*|\s*$}{}g;

    my ($section, $id, $name) = 
        $line =~ /^ ([^(]+) \(\s* ([^)]+) \)\s* (.+) $/x
            or do {
                warn "Error with expected format -- ";
                next;
            };

    say "$section, $id, $name";
}
close $fh;

我使用s{}{}分隔符，因为s///会将标记语法高亮显示符与此模式混淆，这也是一个很好的演示，因为这些有时会帮助提高可读性。

您可以将检索到的变量存储在合适的数据结构中。可以想到数组和散列（及其引用）的任何组合，具体取决于稍后需要对它们进行的操作。请参阅Cookbook of Data Structures (perldsc)。

关于错误处理的注意事项。由于所有匹配都不涉及*（允许零匹配 - 没有），如果您的格式的任何组件不符合预期，则根本不会匹配，我们得到一个错误。 .+非常宽松，但仍需要某些。这就是首先剥离尾随空格的原因，因此单独的空格不能满足最后一个模式(.+)。

如果唯一的目标是课程ID，而我们某些第一个括号是围绕它

my ($id) = $line =~ / \(\s* ([^)]+) \) /x  or do { ... };

我们现在只需匹配并捕捉中间部分，括号内的内容。

Answer 2

#!/usr/bin/env perl

use strict;
use warnings;

while( my $line = <DATA> ) {
    if (my ($courseID) = ($line =~ /[^(]+\(([^)]+)\)/) ) {
        print "course-ID = $courseID; -- line was $line";
    }
}

__DATA__
BIO-2109-01 (12345) Introduction to Biology
CHM-3501-F2-01 (54321) Introduction to Chemistry
IDS-3250-01 (98765) History of US (1860-2000)
SPN-1234-02-F1 (45678) Spanish History (1900-2010)

<强>输出：

course-ID = 12345; -- line was BIO-2109-01 (12345) Introduction to Biology
course-ID = 54321; -- line was CHM-3501-F2-01 (54321) Introduction to Chemistry
course-ID = 98765; -- line was IDS-3250-01 (98765) History of US (1860-2000)
course-ID = 45678; -- line was SPN-1234-02-F1 (45678) Spanish History (1900-2010)

我使用的模式/[^(]+\(([^)]+)\)/也可以写成

/ [^(]+     # 1 or more characters that are not a '('
  \(        # a literal '('. You must escape that because you don't want
            # to start it a capture group.
  ([^)]+)   # 1 or more chars that are not a ')'.
            # The sorrounding '(' and ')' capture this match
  \)        # a literal ')'
/x

/x修饰符允许您在模式中插入空格，注释甚至换行符。

以防您对/x不确定。你的确可以写：

while( my $line = <DATA> ) {
    if (my ($courseID) = ($line =~ / [^(]+   # …
                                     \(      # …
                                     ([^)]+) # …
                                     \)      # …
                                    /x ) ) {
        print "course-ID = $courseID; -- line was $line";
    }
}

这可能不太好阅读，但您也可以将正则表达式存储在单独的变量中：

my $pattern = 
    qr/ [^(]+     # 1 or more characters that are not a '('
        \(        # a literal '(' (you must escape it)
        ([^)]+)   # 1 or more chars that are not a ')'.
                  # The sorrounding '(' and ')' capture this match
        \)        # a literal ')'
      /x;

然后：

if (my ($courseID) = ($line =~ $pattern)) {
    …
}

使用正则表达式从字符串中提取匹配模式，并使用perl

2 个答案: