解析Wikipedia URL-无法打开流:HTTP请求失败

时间:2018-06-25 18:31:55

标签: php

我正在一个简单的php页面上执行此操作:

  1. 从网址查询字符串中获取搜索字符串(例如警务人员)
  2. 将搜索字符串附加到维基百科搜索网址(`https://en.wikipedia.org/w/index.php?search=police+officer')
  3. 使用curl获取该搜索字符串的最终重定向URL
  4. 检查重定向的网址是否包含index.php?search-如果包含,则什么也不做
  5. 否则,请展开重定向的URL,并从URL(Police_officer)中获取最后一个值
  6. 将该值附加到Wikipedia URL,该URL返回该Wiki记录(https://en.wikipedia.org/api/rest_v1/page/summary/Police_officer)的JSON数据
  7. 使用file_get_contents()读取JSON数据并取回数据-例如title

出于某种原因,在以下代码行上:

$json = file_get_contents($url_json);

$ url_json

https://en.wikipedia.org/api/rest_v1/page/summary/Santa_claus

我收到此错误:

Warning: file_get_contents(https://en.wikipedia.org/api/rest_v1/page/summary/Santa_claus): failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in C:\xampp\public_html\test.php on line 49

但是我可以在浏览器中转到该URL,并看到与该URL相同的数据类型:

https://en.wikipedia.org/api/rest_v1/page/summary/Police_officer

对于那个,file_get_contents返回数据就很好了。

我使用了以下代码:

function get_http_response_code($url) {
    $headers = get_headers($url);
    return substr($headers[0], 9, 3);
}

确认两个页面的响应代码均为200。

这是我的基本测试代码:

$var = $_GET['var'];
$var = str_replace(" ", "+", $var);

$url1 = "https://en.wikipedia.org/w/index.php?search=$var";

echo "<hr /> url1: $url1 <hr />";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url1);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$a = curl_exec($ch);
$redirected_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);

echo "<hr /> url2: $redirected_url <hr />";

$url_search = strpos($redirected_url, "index.php?search");

echo "<hr /> url_search: $url_search <hr />";

function get_http_response_code($url) {
    $headers = get_headers($url);
    return substr($headers[0], 9, 3);
}

$url_response = get_http_response_code($redirected_url);

echo "<hr /> url_response: $url_response <hr />";

if ($url_search > 0) {

    // do nothing

} else {

    $tmp = explode('/', $redirected_url);
    $end = end($tmp);

    $url_json = "https://en.wikipedia.org/api/rest_v1/page/summary/$end";

    echo "<hr /> url_json: $url_json <hr />";

    $json = file_get_contents($url_json);

    if ($json) {

        $data = json_decode($json, TRUE);

        if ($data) {
            $wiki_page = $data['content_urls']['desktop']['page'];
            echo "<hr /> wiki_page: $wiki_page <hr />";
        }

    }

}

我错过了什么?

1 个答案:

答案 0 :(得分:0)

修复了我使用curl而不是file_get_contents

$var = $_GET['var'];
$var = str_replace(" ", "+", $var);

$url1 = "https://en.wikipedia.org/w/index.php?search=$var";

echo "<hr /> url1: $url1 <hr />";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url1);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$a = curl_exec($ch);
$redirected_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);

echo "<hr /> url2: $redirected_url <hr />";

$url_search = strpos($redirected_url, "index.php?search");

echo "<hr /> url_search: $url_search <hr />";

function get_http_response_code($url) {
    $headers = get_headers($url);
    return substr($headers[0], 9, 3);
}

function file_get_contents_curl($url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);  
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 3);     
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
    $html = curl_exec($ch);
    curl_close($ch);
    return $html;
}

$url_response = get_http_response_code($redirected_url);

echo "<hr /> url_response: $url_response <hr />";

if ($url_search > 0) {

    // do nothing

} else {

    $tmp = explode('/', $redirected_url);
    $end = end($tmp);

    $url_json = "https://en.wikipedia.org/api/rest_v1/page/summary/$end";

    echo "<hr /> url_json: $url_json <hr />";

    //$json = file_get_contents($url_json);

    $json = file_get_contents_curl($url_json);

    echo "<hr /> json: $json <hr />";

    if ($json) {

        $data = json_decode($json, TRUE);

        echo "<hr /> data: $data <hr />";

        if ($data) {
            $wiki_page = $data['content_urls']['desktop']['page'];
            echo "<hr /> wiki_page: $wiki_page <hr />";
        }

    }

}