我正在尝试打开一个url,使用urlopen()搜索html代码中的单词。(就像一个爬虫)。但是当我在urljoin之后使用它时它不起作用。有什么方法可以做到这一点。
while len(urls) > 0:
htmltext = urlopen(urls[0]).read()
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('a',href=True):
tag['href'] = urljoin(url1,tag['href'])
#in_code=tag['href'].read()
in_code = urlopen(tag['href'])
print(in_code)
#print(tag['href'])
htmlcode = tag['href'].find('student')
if htmlcode > 0:
file.write(tag['href']+'\n')
urls.pop();
file.close()
这是我得到的错误
c:\Python27\crawler>webcrw.py
Traceback (most recent call last):
File "C:\Python27\crawler\webcrw.py", line 21, in <module>
in_code = urlopen(tag['href'])
File "C:\Python27\lib\urllib.py", line 86, in urlopen
return opener.open(url)
File "C:\Python27\lib\urllib.py", line 207, in open
return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 344, in open_http
h.endheaders(data)
File "C:\Python27\lib\httplib.py", line 954, in endheaders
self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 814, in _send_output
self.send(msg)
File "C:\Python27\lib\httplib.py", line 776, in send
self.connect()
File "C:\Python27\lib\httplib.py", line 757, in connect
self.timeout, self.source_address)
File "C:\Python27\lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed