大家好,我是语言Python
的初级程序员,我需要帮助。
这是我在Python中的代码,它给出了错误,请帮助修复
urllib.error.URLError:urlopen错误[Errno 11001] getaddrinfo失败
的Python:
# -*- coding: utf-8 -*-
import urllib.request
from lxml.html import parse
WEBSITE = 'http://allrecipes.com'
URL_PAGE = 'http://allrecipes.com/recipes/110/appetizers-and-snacks/deviled-eggs/?page='
START_PAGE = 1
END_PAGE = 5
def correct_str(s):
return s.encode('utf-8').decode('ascii', 'ignore').strip()
for i in range(START_PAGE, END_PAGE+1):
URL = URL_PAGE + str(i)
HTML = urllib.request.urlopen(URL)
page = parse(HTML).getroot()
for elem in page.xpath('//*[@id="grid"]/article[not(contains(@class, "video-card"))]/a[1]'):
href = WEBSITE + elem.get('href')
title = correct_str(elem.find('h3').text)
recipe_page = parse(urllib.request.urlopen(href)).getroot()
print(correct_str(href))
photo_url = recipe_page.xpath('//img[@class="rec-photo"]')[0].get('src')
print('\nName: |', title)
print('Photo: |', photo_url)
这进入命令提示符:python我收到此错误:
Traceback (most recent call last):
http://allrecipes.com/recipe/236225/crab-stuffed-deviled-eggs/
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1240, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
Name: | Crab-Stuffed Deviled Eggs
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1083, in request
Photo: | http://images.media-allrecipes.com/userphotos/720x405/1091564.jpg
self._send_request(method, url, body, headers)
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1128, in _send_request
self.endheaders(body)
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1079, in endheaders
self._send_output(message_body)
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 911, in _send_output
self.send(msg)
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 854, in send
self.connect()
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 826, in connect
(self.host,self.port), self.timeout, self.source_address)
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\socket.py", line 693, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\socket.py", line 732, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11001] getaddrinfo failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/Ivan/Dropbox/parser/test.py", line 27, in <module>
recipe_page = parse(urllib.request.urlopen(href)).getroot()
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 465, in open
response = self._open(req, data)
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 483, in _open
'_open', req)
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 443, in _call_chain
result = func(*args)
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1268, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Users\Ivan\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1242, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
Process finished with exit code 1
答案 0 :(得分:2)
我将尝试解释深入编程问题的三种主要方法:
(1)使用调试器。您可以遍历代码并在变量使用之前和抛出异常之前检查变量。 Python附带pdb
。在此问题中,您将逐步执行代码并在href
之前打印urlopen()
。
(2)断言。使用Python的assert
来断言代码中的假设。例如,您可以assert not href.startswith('http')
(3)记录。在使用之前记录相关变量。这是我用过的:
我在您的代码中添加了以下内容......
href = WEBSITE + elem.get('href')
print(href)
得到......
Photo: | http://images.media-allrecipes.com/userphotos/720x405/1091564.jpg
http://allrecipes.comhttp://dish.allrecipes.com/how-to-boil-an-egg/
从这里,您可以看到getaddrinfo
问题:您的系统正在尝试在名为allrecipes.comhttp
的主机上打开网址。
根据您的假设WEBSITE
必须预先添加到您从html中提取的每个href
,这似乎是一个问题。
您可以使用类似的内容和a function to determine if the url is absolute来处理绝对与相对href
的情况:
import urlparse
def is_absolute(url):
# See https://stackoverflow.com/questions/8357098/how-can-i-check-if-a-url-is-absolute-using-python
return bool(urlparse.urlparse(url).netloc)
href = elem.get('href')
if not is_absolute(href):
href = WEBSITE + href
答案 1 :(得分:0)
更好的方法是使用解析:
from urllib import parse
href = parse.urljoin(base_url, href)
如果href不完整,这将返回一个完整的href网址。