使用此代码
<?php
/*GET ALL LINKS FROM http://www.w3schools.com/asp/default.asp*/
$page = file_get_contents('http://www.codacons.it/rassegna_quest.asp?idSez=14');
preg_match_all("/<a.*>(.*?)<\/a>/", $page, $matches, PREG_SET_ORDER);
echo "All links : <br/>";
foreach($matches as $match){
echo $match[1]."<br/>";
}
?>
但它不会从此页http://www.codacons.it/rassegna_quest.asp?idSez=14
解析此链接'Questionario':OFFICINE PER L'ASSISTENZA E MANUTENZIONI VEICOLI
'Questionario':RIVENDITORE自动使用
'Questionario':RACCOLTA RICICLATA DEI RIFIUTI DI IMBALLAGGI in PLASTICA
'Questionario':DONNE E POLITICA
为什么???
答案 0 :(得分:1)
我想我应该从典型的“Don't parse HTML with regex”开始。使用XPath(使用DOMXpath)这很容易:
$dom = new DOMDocument();
@$dom->loadHTML($page);
$dom_xpath = new DOMXPath($dom);
$entries = $dom_xpath->evaluate("//a");
foreach ($entries as $entry) {
print $entry->nodeValue;
}
但如果你必须走正则路线,我想贪婪的明星.*
就是你问题的根源。试试这个:
preg_match_all("@<a[^>]+>(.+?)</a>@/", $page, $matches, PREG_SET_ORDER);
答案 1 :(得分:0)
啊,不管......
$page = file_get_contents('http://www.codacons.it/rassegna_quest.asp?idSez=14');
preg_match_all('#<a href="articolo(.*?)" title="Dettaglio notizia">(.*?)</a>#is', $page, $matches);
$count = count($matches[1]);
for($i = 0; $i < $count; $i++){
echo '<a href="articolo'.$matches[1][$i].'">'.trim(strip_tags(preg_replace('#(\s){2,}#is', '', $matches[2][$i]))).'</a>';
}
结果:
<a href="articolo.asp?idInfo=138400&id=">'Questionario':OFFICINE PER L'ASSISTENZA E MANUTENZIONI VEICOLI</a>
<a href="articolo.asp?idInfo=138437&id=">'Questionario':RIVENDITORE AUTO USATE</a>
<a href="articolo.asp?idInfo=127900&id=">'Questionario':RACCOLTA RICICLATA DEI RIFIUTI DI IMBALLAGGI IN PLASTICA</a>
<a href="articolo.asp?idInfo=138861&id=">'Questionario':DONNE E POLITICA</a>