Question

我正在尝试使用CURL DOMDocument或Xpath解析HTML，但是CURLOPT_RETURNTRANSFER始终以字符串形式返回url的HTML，这使其成为无效的HTML来解析

返回的输出：

string(102736) "<!DOCTYPE html>


    <html itemscope itemtype="http://schema.org/QAPage" class="html__responsive">

    <head>

        <title>html - PHP outputting text WITHOUT echo/print? - Stack Overflow</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
        <link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
        <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
        <meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">"

PHP狙击查看输出

$cc = $http->get($url);
var_dump($cc);

使用的CURL库： https://github.com/seikan/HTTP/blob/master/class.HTTP.php

当我删除CURLOPT_RETURNTRANSFER时，我看到没有字符串的HTML（102736），但是即使我没有请求，它也会回显url（参考：curl_exec printing results when I don't want to）

这是我用来解析html的PHP代码：

  $cc = $http->get($url);
  $doc = new \DOMDocument();
  $doc->loadHTML($cc);

  // all links in document
  $links = [];
  $arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
  foreach($arr as $item) { // DOMElement Object
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
    $links[] = [
      'href' => $href,
      'text' => $text
    ];
  }

有什么主意吗？

Answer 1

检查返回值-

print_r($cc);

您可能会发现输出是一个数组（如果代码成功运行）。从库源看，get()的返回值是...

return [
    'header' => $headers,
    'body'   => substr($response, $size),
];

因此，您需要将负载线更改为...

$doc->loadHTML($cc['body']);

更新：

作为上述示例，并使用此问题作为与之配合使用的页面...

$cc = $http->get("https://stackoverflow.com/questions/51319473/curlopt-returntransfer-returns-html-in-string/51319585?noredirect=1#comment89619183_51319585");
$doc = new \DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($cc['body']);

// all links in document
$links = [];
$arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
foreach($arr as $item) { // DOMElement Object
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
    $links[] = [
        'href' => $href,
        'text' => $text
    ];
}

print_r($links);

输出...

Array
(
    [0] => Array
        (
            [href] => #
            [text] => 
        )

    [1] => Array
        (
            [href] => https://stackoverflow.com
            [text] => Stack Overflow
        )

    [2] => Array
        (
            [href] => #
            [text] => 
        )

    [3] => Array
        (
            [href] => https://stackexchange.com/users/?tab=inbox
...

CURLOPT_RETURNTRANSFER返回HTML字符串

1 个答案: