使用urllib& scper进行抓取时添加标头(用户代理)的正确方法是什么? Python 3?

时间:2017-06-14 07:37:24

标签: python python-3.x beautifulsoup urllib

我正在尝试添加用户代理到我在Python 3中使用urllib和BeautifulSoup。这是我的代码

import bs4 as bs
import urllib.request
import urllib.parse
from random import choice
from time import sleep
import os

user_agents = [
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0.1) Gecko/20100101 Firefox/8.0.1',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.151 Safari/535.19'
]

allUrlData = ['www.bbc.co.uk/news','http://www.bbc.co.uk/news/world']
r = range(2,4)

for url in allUrlData:
    sleep(choice(r))
    version = choice(user_agents)
    headers = {'User-Agent': version}
    req = urllib.request.Request(url, None, headers)
    htmlText = urllib.request.urlopen(req).read()
    soup = bs.BeautifulSoup(htmlText, 'lxml')

如果我将req对象传递给urlopen()方法,它仍会包含用户代理,我会感到很困惑。

此代码是否正常运行并通过用户代理?

我是否需要使用Request.add_header(key, val)才能使其正常工作?

非常感谢您的帮助。

0 个答案:

没有答案