在php中显示网页的页面源

时间:2012-07-13 13:58:39

标签: php xpath web-scraping

如何在字符串变量中检索特定网页的整个页面源信息并在php中回显它。我是php的新手并且不知道这样做可以任何人给我完整的源代码。 以下是我的源代码:

<?php
$dom = new DOMDocument;
$dom->loadHtmlFile('http://www.google.com');

$xpath = new DOMXPath($dom);

$elements = $xpath->query('//input[@name="session_id"]');
if ($elements->length) {
    echo "found: ", $elements->item(0)->getAttribute('value');
} else {
    echo "not found";
}
}
?>

我用

替换了上面的代码

刚刚将网址更改为: 'http://www.flipkart.com/professional-android-2-application-development-8126525894/p/itmdytmwpjzyhade?pid=9788126525898&ref=8a47bf68-7558-43ce-a9b2-17c1ac119e84'

但它会出现以下错误: 警告:file_get_contents(http://www.flipkart.com/professional-android-2-application-development-8126525894/p/itmdytmwpjzyhade?pid=9788126525898&ref=8a47bf68-7558-43ce-a9b2-17c1ac119e84)[ function.file-get-contents]:无法打开流:HTTP请求失败!在第2行的C:\ wamp \ www \ displaycontentswebpage.php

预期结果:(页面来源)

<title>Professional Android 2 Application Development 8126525894: Book: Reto Meier (9788126525898) | Flipkart.com</title>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
<!--<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" /> -->
<meta property="fb:page_id" content="102988293558" />
<meta property="fb:admins" content="658873552,1412400758,624500995,100000233612389"/>
<meta name="Keywords" content="professional android 2 application development, buy professional android 2 application development, professional android 2 application development india, professional android 2 application development review, reto meier, 8126525894, 9788126525898" />
<meta name="Description" content="Professional Android 2 Application Development by Reto Meier. Rs.449, Save 25%. Buy Professional Android 2 Application Development, All India Free Home Delivery. 8126525894, 9788126525898 |" />


    <link rel="canonical" href="http://www.flipkart.com/professional-android-2-application-development-8126525894/p/itmdytmwpjzyhade" />
<link rel='shortcut icon' href='http://img5.flixcart.com/www/prod/images/favicon-18354.ico' />................something something..........................




src="http://googleads.g.doubleclick.net/pagead/viewthroughconversion/1017598645/?value=0&amp;label=9tgBCLOv-QIQtaWd5QM&amp;guid=ON&amp;script=0"/>
                </div>
            </noscript></div>

请帮助。

3 个答案:

答案 0 :(得分:1)

当您在$dom变量中加载文档时,您可以执行以下操作:

echo htmlspecialchars($dom->saveHTML());

请参阅saveHTML

手册

我正在使用htmlspecialchars,以便显示html而不是呈现。

答案 1 :(得分:0)

这是通过PHP file_get_contents()中的单个函数实现的(http://php.net/file-get-contents)这将文件的内容作为字符串返回。

// print source to current output
echo file_get_contents( 'http://www.google.com' );

// print content as readble format
echo htmlspecialchars( file_get_contents( 'http://www.google.com' ), ENT_SUBSTITUTE );

如果输入在给定编码中包含无效的代码单元序列,则ENT_SUBSTITUTE - 或ENT_IGNORE - 标志对于htmlspecialchars()是必需的。见http://php.net/htmlspecialchars#refsect1-function.htmlspecialchars-returnvalues

答案 2 :(得分:0)

$dom = new DOMDocument ('1.0');

@$dom->loadHTMLfile ('https://mp3skull.cr');

$thisi=$dom->saveHTML();

echo htmlentities($thisi);

这将打印页面的html源