错误的正则表达式有效

时间:2011-08-11 20:05:56

标签: php regex

为什么会这样?正则表达式忽略标记<a并转到上一个标记<a

$url = 'urband.net';
$p = '%(.{0,5})<a\s+href=".*?';
$p .= $url;
$p .= '.*?"\s*>(.*?)</a>(.{0,5})%imm';

$s = file_get_contents("http://boringmachines.blogspot.com/2006/12/bitbin-herb-recordings.html");
$out = preg_match_all($p, $s, $matches, PREG_SET_ORDER);
print_r($matches);

我得到阵列:

Array
(
    [0] => Array
        (
            [0] => /div><a href="http://photos1.blogger.com/x/blogger/1112/3281/1600/484028/aliasEPlined.jpg"><img style="FLOAT: left; MARGIN: 0px 10px 10px 0px; WIDTH: 162px; CURSOR: hand; HEIGHT: 149px" height="124" alt="" src="http://photos1.blogger.com/x/blogger/1112/3281/320/925013/aliasEPlined.jpg" width="199" border="0" /></a><span style="font-size:85%;">Due to last weeks bad weather here in Glasgow, I was unable to connect to the web and keep up those regular <a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=57230462">Herb Recordings </a>mp3's. Instead, I posted a <a href="http://boringmachines.blogspot.com/2006/11/bitbin-herb-recordings.html#links">video</a> of one of their earlier releases, BitBin. Thankfully, some good has came from thsoe storms, as Herb have kindly donated another mp3, in the form of "<em>May</em>" by BitBin.</span><br /><span style="font-size:85%;"></span><br /><span style="font-size:85%;"><a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&amp;friendID=26396670">BitBin</a> is a London based artist and had his "Alias" ep released by Herb earlier this year. He influences are both broad, and for and electronic producer, quite unusual. The likes of Brian Eno, Bola and Warp Records, sit side by side with Brian Wilson, Captain Beefheart and dEUS. His bio may explain a few things, as BitBin claims he is all about "<em>glitching his way through any field of music and reality</em>"</span><br /><span style="font-size:85%;"></span><br /><span style="font-size:85%;">"<em>May</em>" itself is an expansive and dark slice of electronica reminiscent of Bola and Gescom. For me, however, this is akin to the music Thom Yorke has been pushing Radiohead towards over the last few years. The beats echo those of "<em>Idioteque</em>", and believe, me that is no bad thing.</span><br /><span style="font-size:85%;"></span><br /><span style="font-size:85%;">The "Alias" ep can be ordered<a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=57230462"> here</a>, however, the cd release will feature 3 extra tracks, "<em>making it, one longer trip</em>". An <a href="http://www.urband.net/interview/bitbin/index.html">interview and podcast</a> with
            [1] => /div>
            [2] => interview and podcast
            [3] =>  with
        )

)

虽然必须得到:

Array
(
    [0] => Array
        (
            [0] => . An <a href="http://www.urband.net/interview/bitbin/index.html">interview and podcast</a> with
            [1] => . An 
            [2] => interview and podcast
            [3] =>  with
        )

)

2 个答案:

答案 0 :(得分:3)

欢迎使用HTML上的正则表达式的喜悦和奇迹。请尝试使用DOM代替在HTML中查找您要查找的内容。

//a[contains(@href,'urband.net')]这样的XPath查询比正则表达式更准确。

答案 1 :(得分:1)

尝试:

$url = 'urband\.net';
$p = '%(.{0,5})<a\s+href="[^"]*';
$p .= $url;
$p .= '[^"]*"\s*>(.*?)</a>(.{0,5})%imm';

编辑 - 使用Perl测试:

$/ = undef;

my $str = <DATA>;
my $count = 0;

while ($str =~ /(.{0,5})<a\s+href="[^"]*urband\.net[^"]*"\s*>(.*?)<\/a>(.{0,5})/sg)
{
   print "Array\n";
   print "(\n";
   print "    [$count] => Array\n";
   print "        (\n";
   print "            [0] => $&\n";
   print "            [1] => $1\n";
   print "            [2] => $2\n";
   print "            [3] => $3\n";
   print "        )\n";
   print "\n";
   print ")\n";
   ++$count;
}

输出:

Array
(
    [0] => Array
        (
            [0] => . An <a href="http://www.urband.net/interview/bitbin/index.html">interview and podcast</a> with
            [1] => . An
            [2] => interview and podcast
            [3] =>  with
        )

)