Scrapy:403错误所有请求

时间:2018-05-26 09:51:42

标签: python python-3.x scrapy scrapy-spider

我的scrapy抓取工具使用random proxies,它可以在我的计算机上运行。但是当我在vps上运行它时,它会在每次请求时返回403错误。

2018-05-26 09:43:18 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-05-26 09:43:18 [scrapy.proxies] DEBUG: Using proxy <http://104.237.210.29:2716>, 20 proxies left
2018-05-26 09:43:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.yelp.com/> (failed 1 times): 403 Forbidden
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Using proxy <http://104.237.210.173:5195>, 20 proxies left
2018-05-26 09:43:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.yelp.com/> (failed 1 times): 403 Forbidden
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Using proxy <http://104.237.210.93:3410>, 20 proxies left
2018-05-26 09:43:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.yelp.com/> (failed 1 times): 403 Forbidden

我在vps上手动检查了firefox上的代理,我可以毫无错误地访问网站。

这是我的设置,它与我电脑上的设置相同:

DOWNLOADER_MIDDLEWARES = {
   # 'monitor.middlewares.MonitorDownloaderMiddleware': 543,
   # Proxies
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    # Proxies end
    # Useragent
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'random_useragent.RandomUserAgentMiddleware': 400,
    # Useragent end
}

# Random useragent list
USER_AGENT_LIST = r"C:\Users\Administrator\Desktop\useragents.txt"

# Retry many times since proxies often fail
RETRY_TIMES = 5
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
PROXY_LIST = r"C:\Users\Administrator\Desktop\proxies.txt"

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

2 个答案:

答案 0 :(得分:0)

不确定问题是什么,但发现很多人使用scrapy_proxies时出现问题。我正在使用scrapy-rotating-proxies代替。它由维护scrapy框架的kmike维护,所以我觉得它更好。

答案 1 :(得分:0)

有时您会收到403,因为robots.txt禁止在您要抓取的整个网站或部分网站上使用机器人。

然后,首先写入settings.py ROBOTSTXT_OBEY = False。我在您的设置中看不到它。

不要认为robots.txt不够笼统。您还必须在settings.py中将您的用户代理设置为常规浏览器。例如:USER_AGENT='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7' 最好的办法是在设置中创建一堆用户代理列表,例如:

USER_AGENT_LIST = [
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7',
    ...,
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10'
]

您似乎做到了。然后使其随机,您似乎也这样做了。

最后,这是可选的,但我让您看看是否对您有用,在settings.py中编写一个DOWNLOAD_DELAY = 3,该值至少为1。理想的情况是也使其随机。它使您的蜘蛛像浏览器一样工作。据我所知,太快的下载延迟会使网站理解这是一个由假用户代理组成的机器人。如果网站管理员有丰富的经验,那么他会制定规则并设置许多障碍,以保护自己的网站免受机器人的攻击。<​​/ p>

今天早上,我在我的外壳上测试了与您相同的问题。希望对您有用。