Question

我需要使用JavaScript获取页面的html内容，页面也可以在另一个域上，但在JavaScript中是什么样的wget。我想将它用于一种网络爬虫。

使用JavaScript，我如何获取页面内容，前提是我有一个URL，并将其转换为字符串？

Answer 1

通过JavaScript通过HTTP加载内容的一般方法是use the XMLHttpRequest object。这取决于same origin policy，以便访问您必须访问circumvent it的其他域上的内容。

这假设您在Web浏览器中运行JS（由暗示“该页面也可以在另一个域”）。如果你不是那样，其他选择将对你开放。例如，使用nodejs，您可以使用the http client。

Answer 2

试试这个：

function cbfunc(html) { alert(html.results[0]); }
$.getScript('http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22' + 
encodeURIComponent(url) + '%22&format=xml&diagnostics=true&callback=cbfunc');

DEMO

More about YQL

Answer 3

如果您还要捕获html标记，可以将它们连接到html，如下所示：

 function getPageHTML() {
       return "<html>" + $("html").html() + "</html>";
    }

How do I get the entire page's HTML with jQuery?

如何使用JavaScript将网页转换为字符串？

3 个答案:

DEMO