我正在尝试使用CURL DOMDocument或Xpath解析HTML,但是CURLOPT_RETURNTRANSFER始终以字符串形式返回url的HTML,这使其成为无效的HTML来解析
返回的输出:
string(102736) "<!DOCTYPE html>
<html itemscope itemtype="http://schema.org/QAPage" class="html__responsive">
<head>
<title>html - PHP outputting text WITHOUT echo/print? - Stack Overflow</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
<link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
<link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
<meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">"
PHP狙击查看输出
$cc = $http->get($url);
var_dump($cc);
使用的CURL库: https://github.com/seikan/HTTP/blob/master/class.HTTP.php
当我删除CURLOPT_RETURNTRANSFER时,我看到没有字符串的HTML(102736),但是即使我没有请求,它也会回显url(参考:curl_exec printing results when I don't want to)
这是我用来解析html的PHP代码:
$cc = $http->get($url);
$doc = new \DOMDocument();
$doc->loadHTML($cc);
// all links in document
$links = [];
$arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
foreach($arr as $item) { // DOMElement Object
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = [
'href' => $href,
'text' => $text
];
}
有什么主意吗?
答案 0 :(得分:0)
检查返回值-
print_r($cc);
您可能会发现输出是一个数组(如果代码成功运行)。从库源看,get()
的返回值是...
return [
'header' => $headers,
'body' => substr($response, $size),
];
因此,您需要将负载线更改为...
$doc->loadHTML($cc['body']);
更新:
作为上述示例,并使用此问题作为与之配合使用的页面...
$cc = $http->get("https://stackoverflow.com/questions/51319473/curlopt-returntransfer-returns-html-in-string/51319585?noredirect=1#comment89619183_51319585");
$doc = new \DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($cc['body']);
// all links in document
$links = [];
$arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
foreach($arr as $item) { // DOMElement Object
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = [
'href' => $href,
'text' => $text
];
}
print_r($links);
输出...
Array
(
[0] => Array
(
[href] => #
[text] =>
)
[1] => Array
(
[href] => https://stackoverflow.com
[text] => Stack Overflow
)
[2] => Array
(
[href] => #
[text] =>
)
[3] => Array
(
[href] => https://stackexchange.com/users/?tab=inbox
...