Question

我正在尝试在python scraper中实现代理然而，看来我不能在urlopen（）中使用参数代理，如我所看到的教程中所建议的那样（可能是版本的东西？！）

proxy = {'http' : 'http://example:8080' }
req = urllib.request.Request(Site,headers=hdr, proxies=proxy)
resp = urllib.request.urlopen(req).read()

所以我尝试从documentation获取智能请求，建议创建一个开启者。但是这没有标头参数。并提出类似这样的事情opener.addheaders = [] 没有我尝试过的工作。（代理IP的测试工作正在发挥作用）
以下安全看起来对我来说是最好的做法，但抛出＆＃34;无法找到文件错误＆＃34;。不确定为什么。
如果你能告诉我如何将代理与完整的标题集一起使用，那就太好了。

代码：

import bs4 as bs
import urllib.request
import ssl
import re
from pprint import pprint  ## for printing out a readable dict. can be deleted afterwards

#########################################################
##                  Parsing with beautiful soup
#########################################################

ssl._create_default_https_context = ssl._create_unverified_context
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}
Site = 'https://example.com'
proxy = {'http' : 'http://example:8080' }

def openPage(Site, hdr):
    ## IP check
    print('Actual IP', urllib.request.urlopen('http://httpbin.org/ip').read())

    req = urllib.request.Request(Site,headers=hdr)
    opener = urllib.request.FancyURLopener(proxy)
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]

    ## IP check
    print('Fake IP', opener.open('http://httpbin.org/ip').read())
    resp = opener.open(req).read()
##    soup = bs.BeautifulSoup(resp,'lxml')
##    return(soup)

soup = openPage(Site,hdr)....

错误：

Traceback (most recent call last):   File "C:\Program Files\Python36\lib\urllib\request.py", line 1990, in open_local_file
    stats = os.stat(localname) FileNotFoundError: [WinError 2] The system cannot find the file specified: 'urllib.request.Request object at 0x000001D94816A908'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):   File "C:/Projects/Python/Programms/WebScraper/scraper.py", line 72, in <module>
    mainNav()   File "C:/Projects/Python/Programms/WebScraper/scraper.py", line 40, in mainNav
    soup = openPage(Site,hdr,ean)   File "C:/Projects/Python/Programms/WebScraper/scraper.py", line 32, in openPage
    resp = opener.open(req).read()   File "C:\Program Files\Python36\lib\urllib\request.py", line 1762, in open
    return getattr(self, name)(url)   File "C:\Program Files\Python36\lib\urllib\request.py", line 1981, in open_file
    return self.open_local_file(url)   File "C:\Program Files\Python36\lib\urllib\request.py", line 1992, in open_local_file
    raise URLError(e.strerror, e.filename) urllib.error.URLError: <urlopen error The system cannot find the file specified>

Answer 1

以下代码已成功完成。我已经从fancyURLopener更改为使用之前定义的代理函数代理安装我自己的opener。之后添加了标题

def openPage(site, hdr, proxy):


    ## Create opener
    proxy_support = urllib.request.ProxyHandler(proxy)
    opener = urllib.request.build_opener(proxy_support)##proxy_support
    urllib.request.install_opener(opener)
    opener.addheaders = hdr

使用FancyURLopener抛出的Python 3 urllib无法找到文件

1 个答案: