我想知道如何访问当前使用的用户代理。
例如,我想在print()
过程中在终端the current user agent is Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1
或日志文件中。我该如何抓取呢?
版本:Scrapy 1.5.2
答案 0 :(得分:2)
如果用户代理设置如下solution
一个可以使用:
settings.py
:
...
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',...,
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36 OPR/48.0.2685.52'
]
...
DOWNLOADER_MIDDLEWARES = {
'chevaux_p_t.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
...
}
middlewares.py
:
...
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(settings.get('USER_AGENT_LIST'))
if ua:
request.headers.setdefault('User-Agent', ua)
#One can do this to get the current user-agent used
print("current user-agent:{}".format(request.headers[b'User-Agent']))
logging.debug("current user-agent:{}".format(request.headers[b'User-Agent']))
是否使用此solution,可以使其在您的Spider类的任何方法中显示为:
import logging
class Spider(scrapy.Spider):
def a_method(self,response):
print("current user-agent:{}".format(response.request.headers['User-Agent']))
logging.debug("current user-agent:{}".format(response.request.headers['User-Agent']))
添加了response
的更改。