Question

simple_html_dom不从某些网站获取数据。对于www.google.pl网站，它会下载页面的源代码，但对于其他如gearbest.com而言，stooq.pl不会下载任何数据。

require('simple_html_dom.php');

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://www.google.com/"); //  work

/*
curl_setopt($ch, CURLOPT_URL, "https://www.gearbest.com/"); // dont work
curl_setopt($ch, CURLOPT_URL, "https://stooq.pl/"); // dont work
*/

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
curl_close($ch);

$html = new simple_html_dom();
$html->load($response);

echo $html;

我应该更改代码以从网站接收数据吗？

Answer 1

这里的根本问题（至少在我的计算机上，可能与您的版本...）是该网站返回压缩后的数据，并且不是通过php和curl正确解压缩，然后再传递给dom 解析器。如果您使用的是php 5.4，则可以使用gzdecode和 file_get_contents自行解压缩。
<?php
    // download the site
    $data = file_get_contents("http://www.tsetmc.com/loader.aspx?ParTree=151311&i=49776615757150035");
    // decompress it (a bit hacky to strip off the gzip header)
    $data = gzinflate(substr($data, 10, -8));
    include("simple_html_dom.php");
    // parse and use
    $html = str_get_html($data);
    echo $html->root->innertext();
请注意，此hack无法在大多数网站上使用。主要原因   在我看来，这似乎是curl并未宣布接受   gzip数据...但是该域上的Web服务器不关注   到该标头，并以gzip将其压缩。然后既不卷曲也不php   实际上检查响应上的Content-Encoding标头，并且   假设它没有被压缩，因此它可以毫无错误地通过   打电话给gunzip。服务器和客户端中的错误都在这里！

对于更强大的解决方案，也许您可以使用curl来获取标题   并亲自检查它们是否需要解压缩。   或者，您可以仅将此hack用于本网站，并使用常规方法   其他人保持简单。

它可能仍然有助于在输出上设置字符编码。   在回显任何内容之前添加此内容，以确保所读取的数据不是   在用户浏览器中被读取为错误的字符集，从而使其损坏：
header('Content-Type: text/html; charset=utf-8');

simple_html_dom不从某些网站获取数据

1 个答案: