无法将loadHTMLfile或file_get_contents用于外部URL

时间:2018-07-04 07:07:04

标签: php dom xpath file-get-contents

我想知道Groupon的活跃交易,所以我写了一个刮板,像:

libxml_use_internal_errors(true);

$dom = new DOMDocument();
@$dom->loadHTMLFile('https://www.groupon.com/browse/new-york?category=food-and-drink&minPrice=1&maxPrice=999');
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//li[@class='slot']//a/@href");
foreach($entries as $e) {
  echo $e->textContent . '<br />';
}

但是当我运行此功能时,浏览器一直加载,只是加载了一些东西,但没有显示任何错误。

我该如何解决?不只是Groupon的案例-我也尝试其他网站,但也无法正常工作。为什么?

1 个答案:

答案 0 :(得分:0)

如何使用CURL加载页面数据。

Not just case with Groupon - I also try other websites but also don't work

我认为这段代码将为您提供帮助,但是您应该为每个要剪贴的网站带来意外情况。

<?php

$dom = new DOMDocument();
$data = get_url_content('https://www.groupon.com', true);
@$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//label");

foreach($entries as $e) {
    echo $e->textContent . '<br />';
}


function get_url_content($url = null, $justBody = true)
{

    /* Init CURL */
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_HEADER, 1);
    curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
    curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
    curl_setopt($ch, CURLOPT_HTTPHEADER, []);
    $data = curl_exec($ch);
    if ($justBody)
        $data = @(explode("\r\n\r\n", $data, 2))[1];

    var_dump($data);
    return $data;
}