Question

Scrapy可以请求包含GET参数的网址，以交互方式浏览回复：

scrapy shell "https://duckduckgo.com/?q=foo"

但是对于某些网站，我的请求会被301重定向，并且会删除网址参数：

DEBUG: Redirecting (301) to <GET http://foo.com/mypage/> 
  from <GET http://foo.com/mypage/?bar=baz>
DEBUG: Crawled (200) <GET http://foo.com/mypage/> (referer: None)

当我在浏览器中正常访问http://foo.com/mypage/?bar=baz时，我没有被重定向，并且GET参数仍然存在。

有人可以建议我如何避免被重定向吗？

Answer 1

受到@ paultrmbrth在评论中的回答的启发，以下是如何使用User Agent spoofing来解决这个问题。

首先，找到您的浏览器的用户代理字符串（我使用http://www.whatsmyuseragent.com/执行了此操作，但可能还有其他方法）。

我是

Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0

现在在project_name/items.py添加以下行：

USER_AGENT = "whatever the user agent string was"

和scrapy shell "http://foo.com/mypage/?bar=baz"将按预期工作。

Scrapy shell获取301重定向到没有参数的URL

1 个答案: