答案

Question

我已检索此网页的内容http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369并将其保存到$webpage。

请注意：

在此网页中，有许多<meta>个标签。其中一个元标记是罪魁祸首，并导致一些问题。此元标记为<meta property="og:description" content="" />。请注意，content的值为空字符串。

我正在阅读网页内容如下：

<?php

$url = 'http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369';

$webpage = file_get_contents($url);

$og_entry_title = "";
$og_entry_content = "";

$doc = new DOMDocument;
$doc->loadHTML($webpage);

$meta_tags = $doc->getElementsByTagName('meta');

foreach ($meta_tags as $meta_tag) {

    if ($meta_tag->getAttribute('property') == 'og:title') {
        $og_entry_title = $meta_tag->getAttribute('content');
    }

    if ($meta_tag->getAttribute('property') == 'og:description') {
        $og_entry_content = $meta_tag->getAttribute('content');
    }

}

// print the results
echo
'$og_entry_title: ' . $og_entry_title
.PHP_EOL.
'$og_entry_content: ' . $og_entry_content;

完成后，$og_entry_title和$og_entry_content的值为：

$og_entry_title: TOP STORIES | DW.COM
$og_entry_content: News and analysis of the top international and European topics Current affairs and background information on poltics, business, science, culture, globalization and the environment.

请在结果中注明以下内容：

$og_entry_title是正确的并且包含页面标题，所以这里没问题

$og_entry_content给出了与我期望的不同的价值。我希望在$og_entry_content中保存一个空字符串;然而，字符串＆＃34;顶级国际和欧洲主题的新闻和分析关于政治，商业，科学，文化，全球化和环境的时事和背景信息。＆＃34; 被保存。此字符串似乎是元数据包含空字符串时返回的回退值（或默认值）。

经过进一步调查后发现，go:description正在从http://www.dw.com网页获取其元标记值。这似乎发生了，因为我的网页包含一个空字符串，返回的值是从网站的根页面检索的。

我有关于$og_entry_content的以下问题：

如何确保将空字符串（不是后备值）保存到$og_entry_content？
为什么要从根页返回此回退值？

感谢。

Answer 1

答案

您的网址中包含特殊字符，需要URL encoded。

解释

首先，假设......

$og_entry_title是正确的并且包含页面标题，所以这里没问题

......错了。

这个标题：

<meta property="og:title" content="تقرير استخباري اميركي: القاعدة تسيطر على غرب العراق | أخبار | DW.COM | 28.11.2006" />

与此标题不同：

<meta property="og:title" content="TOP STORIES | DW.COM" />

其次，大多数现代浏览器都非常棒，可以动态进行URL编码，并且仍然在地址栏中显示特殊字符。

您可以从网络服务器see the response headers获取更多信息。

<?php
$url = 'http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "$url");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$response = curl_exec($ch);

// Then, after your curl_exec call:
$header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
echo '
header
------
'.substr($response, 0, $header_size);

结果显示它无法识别URL与该页面之间的关联：

header
------
HTTP/1.1 301 Moved Permanently
Server: Apache-Coyote/1.1
Location: /
Content-Length: 0
Accept-Ranges: bytes
X-Varnish: 99639238
Date: Thu, 16 Jun 2016 15:42:51 GMT
Connection: keep-alive

HTTP Response Code 301是（永久）重定向到另一个页面的通知。 Location: /表示您应该转到主页。这是一种常见的草率练习，当他们不知道如何处理你时，只需将某人发送到主页。

默认情况下，Curl不会遵循重定向，这就是我们能够检查301响应头的方式。但是file_get_contents会遵循重定向，这就是为什么你得到的内容与你期望的不同。（可能有例外：有一个bug report，其中有些人注意到它并不总是遵循重定向。）

请注意，主页在其content中有og:description：

<?php
echo file_get_contents('http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369');

此输出结果：

...

<meta property="og:description" content="News and analysis of the top international and European topics Current affairs and background information on poltics, business, science, culture, globalization and the environment. " />

...

<meta property="og:title" content="TOP STORIES | DW.COM" />

...

解决方案

您需要做的第一件事是rawurlencode网址：

$url = rawurlencode($url);

然后意识到rawurlencode的命名很差，因为valid URL将包含HTML协议http://或https://，并且还可能包含用于分隔部分的斜杠。这是有问题的，因为rawurlencode会将冒号:转换为%3A并将/斜杠转换为%2F，这会导致http%3A%2F%2Fwww.dw.com%2Far%2F...这样的网址无效。它应该被命名为rawurlencode_parts_of_URL，但他们没有问我:)并引用Phil Karlton的辩护：

计算机科学只有两件难事：缓存失效和 命名事物 。

所以将斜杠和冒号转换回原始形式：

$url = str_replace('%3A',':',str_replace('%2F','/',$url));

最后，您需要做的最后一件事是send a header to your clients to let them know what kind of font encoding to expect。

header("content-type: text/html; charset=utf-8");

否则，您的客户可能正在阅读一些看起来像这样的gobbledygook：

ØªÙ，Ø±ÙŠØ±Ø§Ø³ØªØ®Ø¨Ø§Ø±ÙŠØ§Ù...ÙŠØ±ÙƒÙÙ：Ø§Ù“Ù，Ø§ØØØ¯Ø©ØªØ³ÙŠØ·Ø±Ø¹Ù”Ù‰ØºØ± Ø¨Ø§Ù“Ø¹Ø±Ø§Ù

最终产品

<?php

// let's see error output on screen while in development
// remove these lines for production, and use log files only
error_reporting(-1);
ini_set('display_errors', 'On');

$url = 'http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369';

// URL encode special chars
$url = rawurlencode($url);

// fix colons and slashses for valid URL
$url = str_replace('%3A',':',str_replace('%2F','/',$url));

// make request
$webpage = file_get_contents($url);

$og_entry_title = "";
$og_entry_content = "";

$doc = new DOMDocument;
$doc->loadHTML($webpage);

$meta_tags = $doc->getElementsByTagName('meta');

foreach ($meta_tags as $meta_tag) {

    if ($meta_tag->getAttribute('property') == 'og:title') {
        $og_entry_title = $meta_tag->getAttribute('content');
    }

    if ($meta_tag->getAttribute('property') == 'og:description') {
        $og_entry_content = $meta_tag->getAttribute('content');
    }

}

// set the character set for the client
header("content-type: text/html; charset=utf-8");

// print the results
echo
'$og_entry_title: ' . $og_entry_title
.PHP_EOL.
'$og_entry_content: ' . $og_entry_content;

此输出结果：

$og_entry_title: تقرير استخباري اميركي: القاعدة تسيطر على غرب العراق | أخبار | DW.COM | 28.11.2006
$og_entry_content:

附录

如果您正在查看error logs，并且 一连串的警告：

Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 4 in ... Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 5 in ... Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 6 in ... Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 7 in ... Warning: DOMDocument::loadHTML(): ID topMetaInner already defined in Entity, line: 300 in ... Warning: DOMDocument::loadHTML(): ID langSelectTrigger already defined in Entity, line: 315 in ... Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 546 in ... Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 546 in ... Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 548 in ... Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 548 in ...

这是因为您尝试将DOMDocument类与in-valid HTML and not well-formed XML documents一起使用。但这是一个针对不同问题的话题。

DOM中的空属性返回意外的回退值

1 个答案:

答案

解释

解决方案

最终产品

附录