Question

我想抓取像this一样的网页。

似乎我得到了405错误

2018-04-09 11:18:40.930 c.d.s.b.FetcherBolt FetcherThread #2 [INFO] [Fetcher #3] Fetched https://www.notebooksbilliger.de/lenovo+320+15abr+80xs009bge/incrpc/topprod with status 405 in msec 53

该页面似乎具有抓取工具保护功能。是否可以使用stormcrawler和selenium一起抓取它？

Answer 1

该网站不允许机器人，但如果用户代理看起来不像浏览器，则返回405。您可以使用curl

重现该问题

curl -A -I "https://www.notebooksbilliger.de/lenovo+320+15abr+80xs009bge"

HTTP/1.1 405 Method Not Allowed
Accept-Ranges: bytes
Content-Type: text/html
Server: nginx
Surrogate-Control: no-store, bypass-cache
X-Distil-CS: BYPASS
Expires: Mon, 09 Apr 2018 10:48:02 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Mon, 09 Apr 2018 10:48:02 GMT
Connection: keep-alive

curl -A "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36" -I "https://www.notebooksbilliger.de/lenovo+320+15abr+80xs009bge"

HTTP/1.1 200 OK
Content-Type: text/html
Server: nginx
Surrogate-Control: no-store, bypass-cache
Expires: Mon, 09 Apr 2018 10:48:26 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Mon, 09 Apr 2018 10:48:26 GMT
Connection: keep-alive

一种解决方法可能是建议使用selenium，或者只是更改用户代理，以便模仿浏览器使用的内容。不是很好，因为关于您的抓取工具始终是开放的，但在特定情况下，如果这是他们的意图，该网站会阻止他们robots.txt中的抓取工具。

您可以通过StormCrawler中的配置更改用户代理。

针对http 405代码

1 个答案: