Question

可能重复：
How to parse and process HTML with PHP?

我已经使用此代码从url的给定网站获取html内容。

**Code:**

=================================================================

example URL: http://www.qatarsale.com/EnMain.aspx

/*

$regexp = '/<div id="UpdatePanel4">(.*?)<\/div>/i';

@preg_match_all($regexp, @file_get_contents('http://www.qatarsale.com/EnMain.aspx'), $matches, PREG_SET_ORDER);*/

/*

但$ matches返回空白数组。我想获取div id =“UpdatePanel4”中找到的所有html内容。

如果有人有任何解决方案，请建议我。

由于

Answer 1

首先，请确保服务器允许您获取数据。

第二次 ，请使用html解析器来解析数据。

$html = @file_get_contents('http://www.qatarsale.com/EnMain.aspx');
if (!$html) {
  die('can not get the content!');
}
$doc = new DOMDocument();
$doc->loadHTML($html);
$content = $doc->getElementById('UpdatePanel4');

Answer 2

// Gets the webpage
$html = @file_get_contents('http://www.qatarsale.com/EnMain.aspx');

$startingTag = '<div id="UpdatePanel4">';
// Finds the position of the '<div id="UpdatePanel4">
$startPos = strpos($html, $startingTag);
// Get the position of the closing div
$endPos = strpos($html, '</div>', $startPos + strlen($startingTag));
// Get the content between the start and end positions
$contents = substr($html, $startPos + strlen($startingTag), $endPos);

如果UpdatePanel4 div包含更多div

，您将需要做更多工作

Answer 3

那只是不会有帮助。即使您设法使Regexp正常工作，您使用它的方式也存在两个问题：

如果服务器像这样更改HTML的小部分，该怎么办？<div data-blah="blah" id="UpdatePanel4">？在这种情况下，你也必须改变你的Regexp。
第二个问题：我想你想要div的innerHTML，对吧？在这种情况下，使用regexp处理的方式并不关心嵌套或树结构。您将获得的字符串来自您指定的字符串，直至遇到的第一个 </div>。

解决方案：

使用Regexps解析HTML总是一个坏主意。请改用DOMDocument。

从网站获取html内容

3 个答案: