屏幕使用其IP地址而不是域名来抓取Web服务器

时间:2015-03-25 21:22:14

标签: javascript jquery screen-scraping

这可能吗?它在baseUrl =“http://mashable.com”时有效,但在我给它一个IP地址时不起作用。

<script src='https://raw.github.com/padolsey/jQuery-Plugins/master/cross-domain-ajax/jquery.xdomainajax.js'></script>
<script>$(document).ready(function () {

baseUrl = "https://12.34.56.78:8000/";
$.ajax({
    url: baseUrl,
    type: "get",
    dataType: "",
    success: function (data) {
        alert("Yeah we are om jere");
    });
});

1 个答案:

答案 0 :(得分:3)

这将变得困难,因为许多网站可能托管在同一台服务器上,因此共享相同的IP。它适用于域名,因为您的客户端会将其与GET请求一起发送到Host标头中。

请参阅Stack Overflow的此curl输出:

C:\Users\Yeah>curl --head -i -v stackoverflow.com/
* Hostname was NOT found in DNS cache
*   Trying 198.252.206.140...
* Connected to stackoverflow.com (198.252.206.140) port 80 (#0)
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: stackoverflow.com
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< [...]

您可以看到域名作为标题传递。 相反,如果我尝试使用上面找到的IP地址进行查询,则会导致404错误:

C:\Users\Yeah>curl --head -i -v 198.252.206.140/
* Hostname was NOT found in DNS cache
*   Trying 198.252.206.140...
* Connected to 198.252.206.140 (198.252.206.140) port 80 (#0)
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: 198.252.206.140
> Accept: */*
>
< HTTP/1.1 404 Not Found
HTTP/1.1 404 Not Found
< [...]

作为一个反例,如果我尝试与Facebook网站做类似的话,我会得到这些:

C:\Users\Yeah>curl --head -i -v --insecure -L https://www.facebook.com/
* Hostname was NOT found in DNS cache
*   Trying 31.13.93.3...
* Connected to www.facebook.com (31.13.93.3) port 443 (#0)
* [SSL stuff ...]
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: www.facebook.com
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< [...]

如果我尝试使用上面的IP地址:

C:\Users\Yeah>curl --head -i -v --insecure -L https://31.13.93.3/
* Hostname was NOT found in DNS cache
*   Trying 31.13.93.3...
* Connected to 31.13.93.3 (31.13.93.3) port 443 (#0)
* [SSL stuff ...]
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: 31.13.93.3
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
HTTP/1.1 301 Moved Permanently
< Location: http://www.facebook.com/
Location: http://www.facebook.com/
< [...]

<
* Connection #0 to host 31.13.93.3 left intact
* Issue another request to this URL: 'http://www.facebook.com/'
* Hostname was NOT found in DNS cache
*   Trying 31.13.93.3...
* Connected to www.facebook.com (31.13.93.3) port 80 (#1)
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: www.facebook.com
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
HTTP/1.1 301 Moved Permanently
< [...]

<
* Connection #1 to host www.facebook.com left intact
* Issue another request to this URL: 'https://www.facebook.com/'
* Found bundle for host www.facebook.com: 0x6097814fe0
* Hostname was NOT found in DNS cache
*   Trying 31.13.93.3...
* Connected to www.facebook.com (31.13.93.3) port 443 (#2)
* [SSL stuff ...]
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: www.facebook.com
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< [...]

这里需要-L(跟随重定向)和--insecure(接受任何证书)才能使cUrl最终连接到Facebook网站,但这些是常见的客户端(即浏览器)操作。

因此,这实际上取决于您要筛选废品的特定网站和服务器配置。