将一条线分成两部分

时间:2010-11-29 00:20:46

标签: regex perl

我在亚马逊的$str和以后的代码中剪切并粘贴了乔治迈克尔的DVD曲目列表,通过拆分前两位数字来处理它:

$str = "20 Fastlove 21 Jesus To A Child 22 Spinning the Wheel 23 Older 24 Outside 25 As (with Mary J. Blige) 26 Freeek! 27 Amazing 28 John and Elvis are Dead 29 Flawless (Go To The City) 30 Shoot The Dog 31 Roxanne 32 An Easier Affair 33 If I Told You That (with Whitney Houston) 34 Waltz Away Dreaming 35 Somebody To Love 36 I Can’t Make You Love Me 37 Star People '97 38 You Have Been Loved 39 Killer/ Papa Was A RollIn Stone 40 Round Here";

while ($str =~ /(\d{2}) (\S+)/g) {
        print "$1 $2\n";
}

结果:

20 Fastlove
21 Jesus
22 Spinning
23 Older
24 Outside
25 As
26 Freeek!
27 Amazing
28 John
29 Flawless
30 Shoot
31 Roxanne
32 An
33 If
34 Waltz
35 Somebody
36 I
37 Star
97 38
39 Killer/
40 Round

以上类型的作品,但不包括完整的曲目名称。关于如何获得我想要的结果的任何建议?我期待或想要的结果是:

20 Fastlove
21 Jesus To A Child
22 Spinning the Wheel
[etc.]

7 个答案:

答案 0 :(得分:6)

正如伊格纳西奥所说,这不可能100%准确地完成,因为曲目名称可以包含数字。但是因为你可能会认为曲目编号是连续的,所以你可以接近100%:

my $str = "20 Fastlove 21 Jesus To A Child 22 Spinning the Wheel 23 Older 24 Outside 25 As (with Mary J. Blige) 26 Freeek! 27 Amazing 28 John and Elvis are Dead 29 Flawless (Go To The City) 30 Shoot The Dog 31 Roxanne 32 An Easier Affair 33 If I Told You That (with Whitney Houston) 34 Waltz Away Dreaming 35 Somebody To Love 36 I Cant Make You Love Me 37 Star People '97 38 You Have Been Loved 39 Killer/ Papa Was A RollIn Stone 40 Round Here";

my ($track) = ($str =~ /^(\d+)/) or die "No initial track number";

my $next;
while ($next = $track + 1 and
       $str =~ s/^\s*             # optional initial whitespace
                 $track \s+       # track number followed by whitespace
                 (\S.*?)          # title begins with non-whitespace
                 (?= \s+ $next \s # title stops at next track #
                     | $ )        # or end-of-string
                //x) {
  print "$track $1\n";
  $track = $next;
}

die "$str left over" if $str =~ /\S/; # sanity check

这会修改$str,因此请在必要时进行复制。

如果曲目的标题包含下一曲目编号,则会失败,但这应该是相当不常见的。如果缺少曲目或曲目编号不连续,它也将失败。

答案 1 :(得分:2)

cjm答案的一种变体,它可以非破坏性地扫描输入字符串:

if ($str =~ /^(\d+)/) {
    my ($current, $next) = ($1, $1 + 1);
    while ($str =~ /\G *$current ((?:(?! *$next).)+)/g) {
        print "$current $1\n";
        ($current, $next) = ($next, $next + 1);
    }
}

答案 2 :(得分:2)

这是另一种方法(also on ideone.com)

while ($str =~ /(?<!\S)(\d+)\s+((?!\d+\s)\S+(?:\s+(?!\d+\s)\S+)*)/g) {
    print "$1 $2\n";
}

这假定任何一个或多个数字的后跟空格并且前面没有非空格的数字序列是轨道号。这消除了第37号曲目中'97的标题,但没有任何内容可以阻止歌曲标题中包含裸号。

总的来说,我认为@ cjm的连续数字想法可能是你最好的选择。

答案 3 :(得分:2)

我在这里提出了一个答案,因为我认为它很好地回答了你的具体问题,除了“这个曲目名称包含下一曲目的曲目编号”问题。拥有此属性的相册将会很少。

但我必须说出来,你的问题实际上源于首先采用$str格式。例如,如果你看一下this page的来源,你就可以很容易地从HTML本身中提取曲目名称而不考虑曲目的名称。

那是因为HTML清楚地描绘了曲目。现在我不知道这些信息是否可用,但您可能想重新考虑如何获得这些数据。它可能会让你的生活更轻松。或者,如果不是更容易,至少更准确: - )

答案 4 :(得分:1)

正如Ignacio Vazquez-Abrams所说,带有数字的歌曲名称将成为一个问题,但除了“Star People '97”之外,这应该适用于所有人。

/(\d{2}) (\D+)/g

注意:我不是Perl编码器,但正则表达式在rubular.com中正常工作(除了提到的“'97”情况。)

答案 5 :(得分:1)

您最好的选择是以下内容。但即使其中一个曲目包含下一曲目的编号也存在问题。

#!/usr/bin/perl

use strict;
use warnings;

my $str = "20 Fastlove 21 Jesus To A Child 22 Spinning the Wheel 23 Older 24 Outside 25 As (with Mary J. Blige) 26 Freeek! 27 Amazing 28 John and Elvis are Dead 29 Flawless (Go To The City) 30 Shoot The Dog 31 Roxanne 32 An Easier Affair 33 If I Told You That (with Whitney Houston) 34 Waltz Away Dreaming 35 Somebody To Love 36 I Can’t Make You Love Me 37 Star People '97 38 You Have Been Loved 39 Killer/ Papa Was A RollIn Stone 40 Round Here";

my @parts = split " ", $str;

my %songs;
my $track     = shift @parts;
my $new_track = $track + 1;
my $song      = "";
while (@parts) {
    my $part = shift @parts;
    unless ($part eq $new_track) {
        $song .= " $part";
        next;
    }
    $songs{$track} = $song;
    $song          = "";
    $track         = $new_track;
    $new_track     = $track + 1;
}

for my $track (sort { $a <=> $b } keys %songs) {
    print "$track\t$songs{$track}\n";
}

答案 6 :(得分:1)

你真是太近了:

$str = "20 Fastlove 21 Jesus To A Child 22 Spinning the Wheel 23 Older 24 Outside 25 As (with Mary J. Blige) 26 Freeek! 27 Amazing 28 John and Elvis are Dead 29 Flawless (Go To The City) 30 Shoot The Dog 31 Roxanne 32 An Easier Affair 33 If I Told You That (with Whitney Houston) 34 Waltz Away Dreaming 35 Somebody To Love 36 I Can’t Make You Love Me 37 Star People '97 38 You Have Been Loved 39 Killer/ Papa Was A RollIn Stone 40 Round Here";

while ($str =~ /(\d{2}[^\d]*)/g) {
    print "$1\n";
}

注意正则表达式,我使用[^ ]语法来表示不是那个字符。 [^ \ d]表示不是数字,末尾的星号表示零或更多。

通过指定我希望字符串的其余部分继续,直到找到一个数字,我可以选择名称的其余部分(即,直到 Star People '97 。修改它。所以靠近...

如果您需要两个单独变量中的数字和标题,则可以使用括号。

$str = "20 Fastlove 21 Jesus To A Child 22 Spinning the Wheel 23 Older 24 Outside 25 As (with Mary J. Blige) 26 Freeek! 27 Amazing 28 John and Elvis are Dead 29 Flawless (Go To The City) 30 Shoot The Dog 31 Roxanne 32 An Easier Affair 33 If I Told You That (with Whitney Houston) 34 Waltz Away Dreaming 35 Somebody To Love 36 I Can’t Make You Love Me 37 Star People '97 38 You Have Been Loved 39 Killer/ Papa Was A RollIn Stone 40 Round Here";

while ($str =~ /(\d{2})([^\d]*)/g) {
    my $number = $1;
    my $title = $2;

    print "$number: $title\n";
}

仍然试图找出如何让 Star People '97 工作。我认为它与开头的单引号有关。所有数字前面都有一个空格或位于一行的开头。我想知道是否可以使用它?