电影刮刀,正则表达不会抓住每部电影

时间:2012-02-20 19:05:48

标签: regex perl

这是我的程序从此链接输出的内容(http://www.rottentomatoes.com/movie/box_office.php)。正如你所看到的,我错过了页面上的一些电影,例如18号(一个用于钱)不在那里。我的问题是,任何人都可以检查我的正则表达式,并帮助我找出为什么它没有抓住所有的电影或我的代码中有什么问题,我找不到?

我需要补充说我正在使用lynx命令来获取数据。是的我必须使用它=(。我更新了代码以显示我如何从网页上获取信息。

此外,我只想打印35个字符的电影名称,所以如果它超过了我只想截断之后的所有内容。

输出:

##  ##  Movie Title                           Weekend      Cume   T-Meter
1   2   Safe House                             $78.2M     $7.7k       52%
2   1   The Vow                                $85.5M     $8.0k       30%
3   --  Ghost Rider: Spirit of Vengeance       $22.0M     $6.9k       15%
4   3   Journey 2: The Mysterious Island       $53.2M     $5.7k       43%
5   --  This Means War                         $19.2M     $5.5k       25%
6   4   Star Wars: Episode I - The Phantom Menace (in 3D) $33.7M     $3.0k       57%
7   5   Chronicle                              $51.0M     $2.9k       84%
8   6   The Woman in Black                     $45.3M     $2.6k       63%
9   --  The Secret World of Arrietty            $6.4M     $4.2k       93%
10  7   The Grey                               $47.9M     $1.4k       78%
11  9   The Descendants                        $75.0M     $2.4k       89%
12  13  The Artist                             $27.4M     $2.9k       97%
13  8   Big Miracle                            $16.6M     $1.3k       73%
14  14  Hugo                                   $66.7M     $2.9k       93%
15  11  Red Tails                              $47.5M     $1.4k       36%
16  10  Underworld Awakening                   $61.3M     $1.3k       28%
17  18  The Iron Lady                          $24.4M     $1.7k       53%
19  15  Extremely Loud & Incredibly Close      $30.6M     $1.1k       45%
20  17  Contraband                             $65.7M     $1.2k       49%
21  23  Alvin and the Chipmunks: Chipwrecked  $129.7M     $1.2k       13%
22  20  Mission: Impossible Ghost Protocol    $207.3M     $1.8k       93%
23  22  Tinker Tailor Soldier Spy              $22.7M     $2.6k       84%
24  29  The Adventures of Tintin               $76.4M     $1.3k       75%
25  33  A Separation                            $2.1M     $6.2k       99%
27  31  Albert Nobbs                            $2.4M     $1.6k       53%
28  --  Thin Ice                                $0.2M     $3.6k       72%
29  36  My Week with Marilyn                   $13.6M     $1.5k       84%
30  37  A Dangerous Method                      $5.2M     $1.7k       77%
31  35  Puss in Boots                         $149.0M     $1.0k       83%
33  53  In Darkness                             $0.1M     $5.5k       86%
34  44  We Need to Talk About Kevin             $0.6M     $4.0k       80%
36  48  W.E.                                    $0.2M     $2.5k       13%
37  47  Rampart                                 $0.1M     $1.8k       73%
38  52  Coriolanus                              $0.3M     $2.9k       94%
39  --  Bullhead                               $33.6k     $4.8k       86%
40  --  Undefeated                             $30.9k     $6.2k       92%
42  55  Chico & Rita                           $56.2k     $5.3k       93%
43  54  Pariah                                  $0.7M     $1.5k       96%


Biggest Debut: Ghost Rider: Spirit of Vengeance (3)
Weakest Debut: Undefeated (40)
Biggest Gain: In Darkness (20 places)
Biggest Loss: Underworld Awakening (6 places)

CODE:

my $pageToGrab = "http://www.rottentomatoes.com/movie/box_office.php";
my $command = "/usr/bin/lynx -dump -width=150 $pageToGrab";
my $tempPageFile = `$command`;


print "##  "."##  "."Movie Title                           "."Weekend      "."Cume   "."T-Meter  \n";
do
{
        if ($tempPageFile =~ /(\d+)\s+(\d+|\-\-)\s+(\d+\%)\s+\[\d+\](.*)\s+(\d+)\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\d+)/g)
        {
            $curweek[$i] = $1;
            $lastweek[$i] = $2;
            $tmeter[$i] = $3;
            $title[$i] = $4;
            $weekend[$i] = $7;
            $cume[$i] = $8;
            printf("%-4s%-4s%-38s%7s%10s%10s\n",$curweek[$i], $lastweek[$i], $title[$i], $weekend[$i], $cume[$i], $tmeter[$i]);

            if ($lastweek[$i] ne '--')
            {
                    $gain = $lastweek[$i] - $curweek[$i];
            }

            if( $gain > $largest)
            {
                    $largest = $gain;
                    $biggestgaintitle = $title[$i];
            }

            if( $gain < $smallest)
            {
                    $smallest = $gain;
                    $biggestlosstitle = $title[$i];
            }

            if( $lastweek[$i] eq '--')
            {
                    $moviedebut[$j] = $curweek[$i];
                    $lastmovie = $title[$i];
                    $j++;
            }
            $i++;
    }
}
while($i < 38);

1 个答案:

答案 0 :(得分:2)

这是18:

18 12 2% [82]One for the Money 4 $0.8M $25.5M $830 933

请注意,第3美元金额($ 830)没有M或k后缀。使用[Mk]?,可能是所有3美元金额:

if ($tempPageFile =~ /(\d+)\s+(\d+|\-\-)\s+(\d+\%)\s+\[\d+\](.*)\s+(\d+)\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk]?)\s+(\d+)/g) {

截断:

$title =[$i] = substr $4, 0, 35;

perldoc -f substr