Question

我正在尝试打开一个url，使用urlopen（）搜索html代码中的单词。（就像一个爬虫）。但是当我在urljoin之后使用它时它不起作用。有什么方法可以做到这一点。

这是我的代码

while len(urls) > 0:
        htmltext = urlopen(urls[0]).read()
        soup = BeautifulSoup(htmltext)
        for tag in soup.findAll('a',href=True):
                tag['href'] = urljoin(url1,tag['href'])
                #in_code=tag['href'].read()
                in_code = urlopen(tag['href'])
                print(in_code)
                #print(tag['href'])
                htmlcode = tag['href'].find('student')
                if htmlcode > 0:
                        file.write(tag['href']+'\n')
        urls.pop();
file.close()

这是我得到的错误

c:\Python27\crawler>webcrw.py
Traceback (most recent call last):
  File "C:\Python27\crawler\webcrw.py", line 21, in <module>
    in_code = urlopen(tag['href'])
  File "C:\Python27\lib\urllib.py", line 86, in urlopen
    return opener.open(url)
  File "C:\Python27\lib\urllib.py", line 207, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 344, in open_http
    h.endheaders(data)
  File "C:\Python27\lib\httplib.py", line 954, in endheaders
    self._send_output(message_body)
  File "C:\Python27\lib\httplib.py", line 814, in _send_output
    self.send(msg)
  File "C:\Python27\lib\httplib.py", line 776, in send
    self.connect()
  File "C:\Python27\lib\httplib.py", line 757, in connect
    self.timeout, self.source_address)
  File "C:\Python27\lib\socket.py", line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed

使用urljoin后，urlopen（）无法正常工作

这是我的代码

0 个答案: