Question

大家好，我是语言Python的初级程序员，我需要帮助。

这是我在Python中的代码，它给出了错误，请帮助修复

urllib.error.URLError：urlopen错误[Errno 11001] getaddrinfo失败

的Python：

# -*- coding: utf-8 -*-

import urllib.request
from lxml.html import parse

WEBSITE = 'http://allrecipes.com'

URL_PAGE = 'http://allrecipes.com/recipes/110/appetizers-and-snacks/deviled-eggs/?page='

START_PAGE = 1
END_PAGE = 5

def correct_str(s):
    return s.encode('utf-8').decode('ascii', 'ignore').strip()

for i in range(START_PAGE, END_PAGE+1):
    URL = URL_PAGE + str(i)
    HTML = urllib.request.urlopen(URL)

    page = parse(HTML).getroot()

    for elem in page.xpath('//*[@id="grid"]/article[not(contains(@class, "video-card"))]/a[1]'):
        href = WEBSITE + elem.get('href')
        title = correct_str(elem.find('h3').text)

        recipe_page = parse(urllib.request.urlopen(href)).getroot()
        print(correct_str(href))
        photo_url = recipe_page.xpath('//img[@class="rec-photo"]')[0].get('src')

        print('\nName:  |', title)
        print('Photo: |', photo_url)

这进入命令提示符：python我收到此错误：

Traceback (most recent call last):
http://allrecipes.com/recipe/236225/crab-stuffed-deviled-eggs/
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1240, in do_open

    h.request(req.get_method(), req.selector, req.data, headers)
Name:  | Crab-Stuffed Deviled Eggs
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1083, in request
Photo: | http://images.media-allrecipes.com/userphotos/720x405/1091564.jpg
    self._send_request(method, url, body, headers)
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1128, in _send_request
    self.endheaders(body)
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1079, in endheaders
    self._send_output(message_body)
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 911, in _send_output
    self.send(msg)
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 854, in send
    self.connect()
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 826, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\socket.py", line 693, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\socket.py", line 732, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11001] getaddrinfo failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/Ivan/Dropbox/parser/test.py", line 27, in <module>
    recipe_page = parse(urllib.request.urlopen(href)).getroot()
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 465, in open
    response = self._open(req, data)
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 483, in _open
    '_open', req)
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 443, in _call_chain
    result = func(*args)
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1268, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1242, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>

Process finished with exit code 1

Answer 1

我将尝试解释深入编程问题的三种主要方法：

（1）使用调试器。您可以遍历代码并在变量使用之前和抛出异常之前检查变量。 Python附带pdb。在此问题中，您将逐步执行代码并在href之前打印urlopen()。

（2）断言。使用Python的assert来断言代码中的假设。例如，您可以assert not href.startswith('http')

（3）记录。在使用之前记录相关变量。这是我用过的：

我在您的代码中添加了以下内容......

href = WEBSITE + elem.get('href')                                       
print(href)

得到......

Photo: | http://images.media-allrecipes.com/userphotos/720x405/1091564.jpg
http://allrecipes.comhttp://dish.allrecipes.com/how-to-boil-an-egg/

从这里，您可以看到getaddrinfo问题：您的系统正在尝试在名为allrecipes.comhttp的主机上打开网址。

根据您的假设WEBSITE必须预先添加到您从html中提取的每个href，这似乎是一个问题。

您可以使用类似的内容和a function to determine if the url is absolute来处理绝对与相对href的情况：

import urlparse
def is_absolute(url):
    # See https://stackoverflow.com/questions/8357098/how-can-i-check-if-a-url-is-absolute-using-python
    return bool(urlparse.urlparse(url).netloc)

href = elem.get('href')                                                 
if not is_absolute(href):
    href = WEBSITE + href

Answer 2

更好的方法是使用解析：

    from urllib import parse
    href = parse.urljoin(base_url, href)

如果href不完整，这将返回一个完整的href网址。

urlopen错误[Errno 11001] getaddrinfo失败了吗？

2 个答案: