我有这样的页面内容:
<table width="100%" >
<!--Başla--><tr>
<td><a href="http://www.example.com/duyurular/2014/ekim/kutlama.html" class="duyuru1" target="_blank">• Kutlama
<br /><span class="hmk"> Authority 28.10.2014</span></td></tr><tr><td><hr /></td></tr><!--Son-->
<!--Başla--><tr>
<td><a href="http://www.example.com/duyurular/2014/ekim/genel-kurul.html" class="duyuru1" target="_blank">• Genel Kurul
<br /><span class="hmk"> Authority 28.10.2014</span></td></tr><tr><td><hr /></td></tr><!--Son-->
<!--Başla--><tr>
<td><a href="http://www.example.com/duyurular/2014/ekim/katilimci.pdf" class="duyuru1" target="_blank">• Katılımcı
<br /><span class="hmk"> Authority 22.10.2014</span></td></tr><tr><td><hr /></td></tr><!--Son-->
<!----duyuru başlangıc--->
<tr >
<td ><div align="right"><a href="http://www.example.com/arsiv/duyuru/index.html" target="_blank" class="hmk"><span class="style1">Duyuru Arşivi</span> </a></div>
<!-- Güncel Duyurular Bitişi-->
</td>
</tr>
</table>
我想获得http://www.example.com/duyurular/2014/ekim/kutlama.html
,http://www.example.com/duyurular/2014/ekim/genel-kurul.html
,http://www.example.com/duyurular/2014/ekim/katilimci.pdf
个链接,Kutlama
,Genel Kurul
,Katılımcı
链接内容,{{1} }和Authority
。你看,没有HTML标准。
我试过这样:
dates
当然,我没有管理。你能帮帮我吗?
答案 0 :(得分:1)
有些人不喜欢它,但正则表达式有时可以从HTML中提取内容:
if (preg_match_all('#"(https?:[^"]+)"[^&]+•\s*([^<]+).+Authority ([\d.]+)#', $html, $matches)) {
$urls = $matches[1];
$labels = $matches[2];
$dates = $matches[3];
}
$matches
包含:
[1] => Array
(
[0] => http://www.example.com/duyurular/2014/ekim/kutlama.html
[1] => http://www.example.com/duyurular/2014/ekim/genel-kurul.html
[2] => http://www.example.com/duyurular/2014/ekim/katilimci.pdf
)
[2] => Array
(
[0] => Kutlama
[1] => Genel Kurul
[2] => Katılımcı
)
[3] => Array
(
[0] => 28.10.2014
[1] => 28.10.2014
[2] => 22.10.2014
)
您可能需要trim()
所有结果。