Question

我正在尝试从一个波兰网站下载xml文件。第一天它工作，但我可以将此文件下载到我的服务器（但我可以打开并在我的计算机上下载）。在我的服务器上的文件中应该有xml内容是html内容，告诉我我已被阻止。

我试图通过网站与网站管理员联系，我想从中获取xml，他告诉我，我没有被IP地址阻止。所以问题是我应该在标题中发送什么或者下载此文件的内容？

下载xml文件的代码如下，这是我要下载的xml：http://www.polskatimes.pl/rss/fakty_kraj.xml

$headers[]  = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13";
$headers[]  = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$headers[]  = "Accept-Language:pl-PL,pl;q=0.8";
$headers[]  = "Accept-Encoding:gzip,deflate,sdch";
$headers[]  = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$headers[]  = "Keep-Alive:115";
$headers[]  = "Connection:keep-alive";
$headers[]  = "Cache-Control:max-age=0";

$xml_data = file_get_contents($xml,false,stream_context_create(
    array("http" => array('header' => $headers)))); // your file is in the string "$xml" now.
file_put_contents($xml_md5, $xml_data); // now your xml file is saved.

以详细模式（-v）请求URL：

* About to connect() to www.polskatimes.pl port 80 (#0)
*   Trying 195.8.99.38... connected
* Connectede to www.polskatimes.pl (195.8.99.38) port 80 (#0)
> GET /rss/fakty_kraj.xml HTTP/1.1
> User-Agent: curl/7.21.0 (x86_64-pc-linux-gnu) libcurl/7.21.0 OpenSSL/0.9.8o zlib/1.2.3.4 libidn/1.15 libssh2/1.2.6
> Host: www.polskatimes.pl
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx
< Date: Thu, 18 Apr 2013 10:40:15 GMT
< Content-Type: text/html; charset=utf8
< Transfer-Encoding: chunked
< Connection: close
< Vary: Accept-Encoding
< Expires: Thu, 18 Apr 2013 10:40:15 GMT
< Cache-Control: max-age=0
(html page with message that I am temporary blocked)
* Closing connection #0

Answer 1

要检查场景背后发生的事情（以及您实际需要与否的标题），您需要进行一些分析。这没什么了不起的，你可以使用名为 curl 的软件在命令行上完成。它适用于许多（甚至所有？）计算机平台。

最常见的第一步是以详细模式（-v）请求URL：

$ curl -v http://www.polskatimes.pl/rss/fakty_kraj.xml
* About to connect() to www.polskatimes.pl port 80 (#0)
*   Trying 195.8.99.38... connected
* Connected to www.polskatimes.pl (195.8.99.38) port 80 (#0)
> GET /rss/fakty_kraj.xml HTTP/1.1
> User-Agent: curl/7.21.1 (i686-pc-mingw32) libcurl/7.21.1 OpenSSL/0.9.8r zlib/1.2.3
> Host: www.polskatimes.pl
> Accept: */*
>
< HTTP/1.1 302 Found
< Date: Wed, 17 Apr 2013 17:39:51 GMT
< Server: Apache
< Set-Cookie: sprawdz_cookie=1; expires=Thu, 17-Apr-2014 17:39:51 GMT
< Location: http://www.polskatimes.pl/rss/fakty_kraj.xml?cookie=1
< Vary: Accept-Encoding
< Content-Length: 0
< Connection: close
< Content-Type: text/html; charset=iso-8859-2
<
* Closing connection #0

它向您显示请求（以＆gt; 为前缀）和响应（以＆lt; 为前缀）标头和响应正文（在本例中为空）。正如您所看到的，状态是 302 Found ，这意味着3xx是一个重定向，位置标题告诉我们在哪里：

Location: http://www.polskatimes.pl/rss/fakty_kraj.xml?cookie=1

正如查询参数所示，这是一个cookie检查。 cookie本身也是如此设置：

Set-Cookie: sprawdz_cookie=1; expires=Thu, 17-Apr-2014 17:39:51 GMT

因此，在下一步中，我们将重播最后一个命令，但这一次设置了可以使用-b参数完成的cookie：

$ curl -v -b prawdz_cookie=1 http://www.polskatimes.pl/rss/fakty_kraj.xml
* About to connect() to www.polskatimes.pl port 80 (#0)
*   Trying 195.8.99.38... connected
* Connected to www.polskatimes.pl (195.8.99.38) port 80 (#0)
> GET /rss/fakty_kraj.xml HTTP/1.1
> User-Agent: curl/7.21.1 (i686-pc-mingw32) libcurl/7.21.1 OpenSSL/0.9.8r zlib/1.2.3
> Host: www.polskatimes.pl
> Accept: */*
> Cookie: prawdz_cookie=1
>
< HTTP/1.1 200 OK
< Date: Wed, 17 Apr 2013 17:43:52 GMT
< Server: Apache
< Set-Cookie: sesja_gratka=e38fa0eb93705c8de7ae906198494439; expires=Wed, 24-Apr-2013 17:43:52 GMT; path=/; domain=polskatimes.pl
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Pragma: no-cache
< Vary: Accept-Encoding
< Connection: close
< Transfer-Encoding: chunked
< Content-Type: text/xml; charset=utf-8
<
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title><![CDATA[Fakty - Kraj]]></title>
    <link>http://www.polskatimes.pl/fakty/kraj/</link>
    <atom:link href="http://www.polskatimes.pl/rss/fakty_kraj.xml" rel="self" type="application/rss+xml"/>
    <description><![CDATA[Materia┼éy z dzia┼éu Kraj]]></description>
... (cutted)

所以这立刻就成功了。现在真正好的部分：你知道你需要为请求设置cookie 和 curl显示你已经使用的所有标题：

> GET /rss/fakty_kraj.xml HTTP/1.1
> User-Agent: curl/7.21.1 (i686-pc-mingw32) libcurl/7.21.1 OpenSSL/0.9.8r zlib/1.2.3
> Host: www.polskatimes.pl
> Accept: */*
> Cookie: prawdz_cookie=1

大多数人不需要关心file_get_contents，第一行以及主机：和接受：行。< / p>

User-Agent：标题看起来并不像接受卷曲那样起作用。

所以剩下的就是 Cookie：标头。让我们试试PHP：

$ php -r "echo file_get_contents('http://www.polskatimes.pl/rss/fakty_kraj.xml', null, 
stream_context_create(['http'=>['header'=>['Cookie: prawdz_cookie=1']]]));"
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title><![CDATA[Fakty - Kraj]]></title>
    <link>http://www.polskatimes.pl/fakty/kraj/</link>
    <atom:link href="http://www.polskatimes.pl/rss/fakty_kraj.xml" rel="self" 
    type="application/rss+xml"/>
...  (cutted)

这是直接测试，只需要 Set-Cookie：prawdz_cookie = 1 标头。

xml下载 - 阻止

1 个答案: