Question

我正在处理我的PHP脚本来解析html网页。我使用file_get_contents打开网址来获取内容列表。

以下是代码：

$links = $row['links'];
$result = file_get_contents($links);
$html_content = str_replace("<a id='rowTitle1' class", "<a id='rowTitle1' class",$result);
print $html_content;

这是html输出：

<li class="zc-ssl-pg" id="row1-1" style="">
<span id="row1Time" class="zc-ssl-pg-time">6:00 PM</span>
<a id="rowTitle1" class="zc-ssl-pg-title" href='http://www.mysite.com'>The Middle</a>
<a class="zc-ssl-pg-ep" href='http://www.mysite.com'>"Thanksgiving IV"</a>

请告诉我如何使用file_get_contents从row1-1类的row1Time，rowTitle1和zc-ssl-pg-ep标签中获取值？

Answer 1

正则表达式不是解析HTML的正确工具。 DOM是适合该工作的正确工具：

$dom = new DOMDocument();
$dom->loadHTML($result);
echo $dom->getElementById('row1Time')->nodeValue . "<br>";
echo $dom->getElementById('rowTitle1')->nodeValue . "<br>";
echo $dom->getElementsByTagName('a')->item(1)->nodeValue;

See it in action

由于HTML的结构方式，这仍然有点不确定但如果不改变，这将有效。

Answer 2

$links = $row['links'];
$result = file_get_contents($links);
// $html_content = str_replace("<a id='rowTitle1' class", "<a id='rowTitle1' class",$result); // thats useless !

preg_match('/<span id="row1Time" class="zc-ssl-pg-time">([^<]+)<\/span>/', $html_content, $matches);
$row1Time = $matches[1];

preg_match('/<a id="rowTitle1" class="zc-ssl-pg-title" href='http:\/\/www\.mysite\.com'>([^<]+)<\/a>/', $html_content, $matches);
$rowTitle1 = $matches[1];

print $html_content;

使用file_get_contents获取类标记

2 个答案: