Question

我需要获取一个远程页面，修改一些元素（使用＆＃39; PHP Simple HTML DOM Parser＆＃39;库）并输出修改后的内容。

远程页面的问题是源代码中没有完整的URL，因此不会加载CSS元素和图像。当然，它并没有阻止我修改元素，但输出看起来很糟糕。

例如，打开https://www.raspberrypi.org/downloads/

但是，如果您使用代码

$html = file_get_html('http://www.raspberrypi.org/downloads');
echo $html;

它看起来很糟糕。我试图应用一个简单的黑客，但这有点帮助：

$html = file_get_html('http://www.raspberrypi.org/downloads');
$html=str_ireplace("</head>", "<base href='http://www.raspberrypi.org'></head>", $html);
echo $html;

有没有办法指导＆＃34;用于解析来自＆＃39; http://www.raspberrypi.org＆＃39;的$ html变量的所有链接的脚本换句话说，如何使raspberrypi.org成为＆＃34;主要＆＃34;获取所有图像/ CSS元素的来源？

我不知道如何更好地解释它，但我相信你有了一个主意。

Answer 1

我刚在本地试过这个，我注意到（在源代码中）HTML中的链接标记是这样的：

<link rel='stylesheet' href='/wp-content/themes/mind-control/js/qtip/jquery.qtip.min.css' />

显然需要一个应该在我的本地目录中的文件（如localhost / wp-content / etc ... /）。链接标记的href必须类似于

<link rel='stylesheet' href='https://www.raspberrypi.org/wp-content/themes/mind-control/js/qtip/jquery.qtip.min.css' />

所以你可能想要做的就是找到所有链接标签并添加他们的href属性＆＃34; https://www.raspberrypi.org/＆＃34;在其余的面前。

编辑：嘿，我实际上已经开始使用这个样式，试试这个代码：

$html = file_get_html('http://www.raspberrypi.org/downloads');
$i = 0;
foreach($html->find('link') as $element)
{
       $html->find('link', $i)->href = 'http://www.raspberrypi.org'.$element->href;
       $i++;
}
echo $html;die;

Answer 2

由于只有Nikolay Ganovski提供了解决方案，我编写了一个代码，通过查找不完整的css / img / form标签并使其完整，将部分页面转换为完整。如果有人需要，请找到以下代码：

//finalizes remote page by completing incomplete css/img/form URLs (path/file.css becomes http://somedomain.com/path/file.css, etc.)
function finalize_remote_page($content, $root_url)
{
$root_url_without_scheme=preg_replace('/(?:https?:\/\/)?(?:www\.)?(.*)\/?$/i', '$1', $root_url); //ignore schemes, in case URL provided by user was http://domain.com while URL in source is https://domain.com (or vice-versa)

$content_object=str_get_html($content);
if (is_object($content_object))
    {
    foreach ($content_object->find('link.[rel=stylesheet]') as $entry) //find css
        {
        if (substr($entry->href, 0, 2)!="//" && stristr($entry->href, $root_url_without_scheme)===FALSE) //ignore "invalid" URLs like //domain.com
            {
            $entry->href=$root_url.$entry->href;
            }
        }

    foreach ($content_object->find('img') as $entry) //find img
        {
        if (substr($entry->src, 0, 2)!="//" && stristr($entry->src, $root_url_without_scheme)===FALSE) //ignore "invalid" URLs like //domain.com
            {
            $entry->src=$root_url.$entry->src;
            }
        }

    foreach ($content_object->find('form') as $entry) //find form
        {
        if (substr($entry->action, 0, 2)!="//" && stristr($entry->action, $root_url_without_scheme)===FALSE) //ignore "invalid" URLs like //domain.com
            {
            $entry->action=$root_url.$entry->action;
            }
        }
    }

return $content_object;
}

PHP - 完整显示远程页面的内容

2 个答案: