CURLOPT_RETURNTRANSFER返回HTML字符串

时间:2018-07-13 07:06:07

标签: php html curl

我正在尝试使用CURL DOMDocument或Xpath解析HTML,但是CURLOPT_RETURNTRANSFER始终以字符串形式返回url的HTML,这使其成为无效的HTML来解析

返回的输出:

string(102736) "<!DOCTYPE html>


    <html itemscope itemtype="http://schema.org/QAPage" class="html__responsive">

    <head>

        <title>html - PHP outputting text WITHOUT echo/print? - Stack Overflow</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
        <link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
        <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
        <meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">"

PHP狙击查看输出

$cc = $http->get($url);
var_dump($cc);

使用的CURL库: https://github.com/seikan/HTTP/blob/master/class.HTTP.php

当我删除CURLOPT_RETURNTRANSFER时,我看到没有字符串的HTML(102736),但是即使我没有请求,它也会回显url(参考:curl_exec printing results when I don't want to

这是我用来解析html的PHP代码:

  $cc = $http->get($url);
  $doc = new \DOMDocument();
  $doc->loadHTML($cc);

  // all links in document
  $links = [];
  $arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
  foreach($arr as $item) { // DOMElement Object
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
    $links[] = [
      'href' => $href,
      'text' => $text
    ];
  }

有什么主意吗?

1 个答案:

答案 0 :(得分:0)

检查返回值-

print_r($cc);

您可能会发现输出是一个数组(如果代码成功运行)。从库源看,get()的返回值是...

return [
    'header' => $headers,
    'body'   => substr($response, $size),
];

因此,您需要将负载线更改为...

$doc->loadHTML($cc['body']);

更新

作为上述示例,并使用此问题作为与之配合使用的页面...

$cc = $http->get("https://stackoverflow.com/questions/51319473/curlopt-returntransfer-returns-html-in-string/51319585?noredirect=1#comment89619183_51319585");
$doc = new \DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($cc['body']);

// all links in document
$links = [];
$arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
foreach($arr as $item) { // DOMElement Object
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
    $links[] = [
        'href' => $href,
        'text' => $text
    ];
}

print_r($links);

输出...

Array
(
    [0] => Array
        (
            [href] => #
            [text] => 
        )

    [1] => Array
        (
            [href] => https://stackoverflow.com
            [text] => Stack Overflow
        )

    [2] => Array
        (
            [href] => #
            [text] => 
        )

    [3] => Array
        (
            [href] => https://stackexchange.com/users/?tab=inbox
...