PHP preg_match_all匹配100个匹配项中的74个

时间:2011-04-18 13:43:08

标签: php regex web-scraping preg-match-all

我正在开展一个项目,涉及从主要搜索引擎中搜集(更具体地说 - 检查页面排名和查找类似页面)。随着curl我正在调用搜索引擎,然后使用单个 preg_match_all 我将所有结果都放在一个数组中。我和谷歌和必应没有任何问题,但是当我为雅虎编写脚本时,它有效,但它有一个错误。

我正在荷兰搜索“autobedrijf”进行测试。我在页面上有100个结果,但最终结果数组中只有74个结果。我将正则表达式复制到this tool,插入了Yahoo的页面源代码,并在那里匹配了所有100个结果。

来自Yahoo的第一个结果

First results from Yahoo

但是在我得到的数组中:

-ledensite

-Home Page

-Welkom bij autobedrijf Tolsma Exmorra

-etc。

这意味着在“主页”之后它会跳过3个结果。我试图找到结果5和6之间的任何差异(最后跳过和下一个),但我没有找到任何理由不匹配第5个。

这是 preg_match_all 脚本:

$pattern = '@<div>\s*<h3>\s*<a[^<>]*\shref="([^*"]*\*\*)*([^<>]*)">
([^<>]*(<b>[^<>]*</b>)*[^<>]*(<wbr>)*[^<>]*)</a>\s*</h3>\s*</div>@siU';
preg_match_all($pattern, $result['EXE'], $matches);

这两个结果的Yahoo页面源代码如下:

<li><div class="res"><div><h3><a dirtyhref="http://nl.wrs.yahoo.com
/_ylt=A7x9Qb3rKqxN1XYA1PhzKAx.;_ylu=X3oDMTBydXF0bjc3BHNlYwNzcgR
wb3MDNQRjb2xvA2lyZAR2dGlkAw--/SIG=11n0lgcgp/EXP=1303157611/**http%3a
//www.autobedrijfgreijmans.nl/" class="yschttl spt" 
href="http://nl.wrs.yahoo.com/_ylt=A7x9Qb3rKqxN1XYA1PhzKAx.;
_ylu=X3oDMTBydXF0bjc3BHNlYwNzcgRwb3MDNQRjb2xvA2lyZAR2dGlkAw--/SIG
=11n0lgcgp/EXP=1303157611/**http%3a//www.autobedrijfgreijmans.nl/">
<b>Autobedrijf</b> Greijmans Weert - Toyota <wbr>Specialist Occasions 
APK</a></h3></div><div class="abstr">In het databestand van 
<b>Autobedrijf</b> Greijmans vindt u zeer eenvoudig tweedehands auto's, 
bedrijfswagens, klassiekers. Kijk snel op onze website</div><span 
class="url">www.<b>autobedrijfgreijmans.nl</b></span> - <a 
href="http://nl.wrs.yahoo.com/_ylt=A7x9Qb3rKqxN1XYA1fhzKAx./SIG=
186dh8afd/EXP=1303157611/**http%3a//74.6.239.67/search/cache%3fei=
UTF-8%26p=autobedrijf%26n=100%26va_vt=any%26vo_vt=any%26vp_vt
=any%26vst=0%26vf=all%26vm=p%26u=www.autobedrijfgreijmans.nl/
%26w=autobedrijf%26d=IxshPvbJWijU%26icp=1%26.intl=nl%26sig=yI6R7vJN31J
T92YKlVnT1g--">In de cache</a></div></li>

<li><div class="res"><div><h3><a dirtyhref="http://nl.wrs.yahoo.com
/_ylt=A7x9Qb3rKqxN1XYA1vhzKAx.;_ylu=X3oDMTBybWh0ZnN2BHNlYwNzcgRwb3MDNg
Rjb2xvA2lyZAR2dGlkAw--/SIG=11l17c21h/EXP=1303157611/**http%3a//www.
autobedrijf-tolsma.nl/" class="yschttl spt" href="http://www.autobedrijf
-tolsma.nl/">Welkom bij <b>autobedrijf</b> Tolsma Exmorra</a></h3></div>
<div class="abstr">Welkom op de vernieuwde website van <b>autobedrijf</b> 
Tolsma Exmorra incl.digitale showroom <b>...</b> Welkom bij <b>autobedrijf
</b> tolsma. Op deze site nodigen wij U uit een <b>...</b></div><span 
class="url">www.<b>autobedrijf-tolsma.nl</b></span> - <a href="http://nl.
wrs.yahoo.com/_ylt=A7x9Qb3rKqxN1XYA1_hzKAx./SIG=1840la9c5/EXP=1303157611/
**http%3a//74.6.239.67/search/cache%3fei=UTF-8%26p=autobedrijf%26n=100
%26va_vt=any%26vo_vt=any%26vp_vt=any%26vst=0%26vf=all%26vm=p%26u=
www.autobedrijf-tolsma.nl/%26w=autobedrijf%26d=YBZQIvbJWlDf%26icp=1%26.
intl=nl%26sig=zRU95PdBOTfII93dZ411ZA--">In de cache</a></div></li>

我已经坚持了5个多小时,我无法弄清楚为什么正则表达式工具匹配两个结果,而PHP中的preg_match_all只匹配第二个

如果重要 - 我正在使用Apache和PHP 5.3.5在Windows上进行测试。

欢迎任何建议,如果您有兴趣,我可以提供额外的示例和测试代码。

2 个答案:

答案 0 :(得分:2)

一个未经修饰的起点:

<?php

$url = 'http://nl.search.yahoo.com/search?p=autobedrijf&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t-732';

$html = file_get_contents($url);

$doc = new DOMDocument;
libxml_use_internal_errors(TRUE);
$doc->loadHTML($html);
libxml_use_internal_errors(FALSE);

$titles = $doc->getElementsByTagName('h3');
foreach($titles as $t){
    echo $t->nodeValue . PHP_EOL;
}

?>

答案 1 :(得分:1)

我建议使用Yahoo api。他们提供非常好的服务,以达到雅虎与雅虎查询语言的结果。 http://developer.yahoo.com/yql/

示例:

http://developer.yahoo.com/yql/console/#h=select%20 *%20from%20search.web%20where%20query%3D%22autobedrijf%22