urllib.error.HTTPError:HTTP错误403:禁止

时间:2017-01-07 23:08:57

标签: python http urllib

我在抓取某些页面时收到错误“urllib.error.HTTPError:HTTP Error 403:Forbidden”,并了解在标题中添加http .authorizeRequests() .antMatchers("/high_level_url_A/sub_level_1").hasRole('USER') .antMatchers("/high_level_url_A/sub_level_2").hasRole('USER2') .somethingElse() // for /high_level_url_A/** .antMatchers("/high_level_url_A/**").authenticated() .antMatchers("/high_level_url_B/sub_level_1").permitAll() .antMatchers("/high_level_url_B/sub_level_2").hasRole('USER3') .somethingElse() // for /high_level_url_B/** .antMatchers("/high_level_url_B/**").authenticated() .anyRequest().permitAll() 之类的内容就是解决此问题的方法。

但是,当我正在尝试抓取的URL位于单独的源文件中时,我无法使其正常工作。如何/在哪里可以将User-Agent添加到下面的代码中?

hdr = {"User-Agent': 'Mozilla/5.0"}

谢谢:)

1 个答案:

答案 0 :(得分:3)

您可以使用requests

实现相同目标
import requests
hdrs = {'User-Agent': 'Mozilla / 5.0 (X11 Linux x86_64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 52.0.2743.116 Safari / 537.36'}    
for url in line_in_list:
    resp = requests.get(url, headers=hdrs)
    soup = BeautifulSoup(resp.content, 'html.parser')
    name = soup.find(attrs={'class': "name"})
    description = soup.find(attrs={'class': "description"})
    for text in description:
        print(name.get_text(), ';', description.get_text())
#        time.sleep(5)
    i += 1

希望它有所帮助!