Python3无法建立连接socket.gaierror:名称或服务未知

时间:2019-11-11 09:31:38

标签: python python-3.x python-requests

我正在尝试运行电子邮件收集器,当我手动输入URL而没有循环时,我没有任何连接错误。

import re
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup


def email_harvest(starting_url):
    # starting url. replace google with your own url.
    #starting_url = 'http://www.miet.ac.in'
    print ('this is the starting urli '+starting_url)   
    #starting_url = website_url[i]
#   i += 1
    # a queue of urls to be crawled
    unprocessed_urls = deque([starting_url])

    # set of already crawled urls for email
    processed_urls = set()

    # a set of fetched emails
    emails = set()

    # process urls one by one from unprocessed_url queue until queue is empty
    while len(unprocessed_urls):

        # move next url from the queue to the set of processed urls
        url = unprocessed_urls.popleft()
        processed_urls.add(url)

        # extract base url to resolve relative links
        parts = urlsplit(url)
        base_url = "{0.scheme}://{0.netloc}".format(parts)
        path = url[:url.rfind('/')+1] if '/' in parts.path else url
        print (url)
        # get url's content
        #print("Crawling URL %s" % url)
        try:
            response = requests.get(url)
            print (response.status_code)
        except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
            # ignore pages with errors and continue with next url
            print ("error crawing " % url)
            continue

        # extract all email addresses and add them into the resulting set
        # You may edit the regular expression as per your requirement
        new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
        emails.update(new_emails)
        print(emails)
        # create a beutiful soup for the html document
        soup = BeautifulSoup(response.text, 'lxml')

        # Once this document is parsed and processed, now find and process all the anchors i.e. linked urls in this document
        for anchor in soup.find_all("a"):
            # extract link url from the anchor
            link = anchor.attrs["href"] if "href" in anchor.attrs else ''
            # resolve relative links (starting with /)
            if link.startswith('/'):
                link = base_url + link
            elif not link.startswith('http'):
                link = path + link
            # add the new url to the queue if it was not in unprocessed list nor in processed list yet
            if not link in unprocessed_urls and not link in processed_urls:
                unprocessed_urls.append(link)


website_url = tuple(open('text.txt','r'))
i = 0
while i < (len(website_url)+1):
    print (i)
    starting_url = 'http://'+ website_url[i]
    email_harvest(starting_url)
    i +=1

但是,当我从文件加载URL时,出现以下错误“名称或服务错误”

  

回溯(最近通话最近):文件   “ /usr/lib/python3/dist-packages/urllib3/connection.py”,第141行,在   _new_conn       (self.host,self.port),self.timeout,** extra_kw)文件“ /usr/lib/python3/dist-packages/urllib3/util/connection.py”,第60行,   在create_connection中       用于socket.getaddrinfo(主机,端口,家庭,socket.SOCK_STREAM)中的res:文件“ /usr/lib/python3.6/socket.py”,行745,   在getaddrinfo中       _socket.getaddrinfo中的res(主机,端口,家庭,类型,原型,标志):socket.gaierror:[Errno -2]名称或服务未知

     

在处理上述异常期间,发生了另一个异常:

     

回溯(最近通话最近):文件   “ /usr/lib/python3/dist-packages/urllib3/connectionpool.py”,第601行,   在urlopen中       chunked = chunked)文件“ /usr/lib/python3/dist-packages/urllib3/connectionpool.py”,第357行,   在_make_request中       conn.request(方法,URL,** httplib_request_kw)文件“ /usr/lib/python3.6/http/client.py”,第1254行,在请求中       self._send_request(方法,URL,正文,标头,encode_chunked)文件“ /usr/lib/python3.6/http/client.py”,行1300,在_send_request中       self.endheaders(body,encode_chunked = encode_chunked)文件“ /usr/lib/python3.6/http/client.py”,行1249,在endheaders中       self._send_output(message_body,encode_chunked = encode_chunked)文件“ /usr/lib/python3.6/http/client.py”,行1036,在_send_output中       self.send(msg)发送中的文件“ /usr/lib/python3.6/http/client.py”,第974行       self.connect()文件“ /usr/lib/python3/dist-packages/urllib3/connection.py”,行166,在   连接       conn = self._new_conn()文件“ /usr/lib/python3/dist-packages/urllib3/connection.py”,第150行,在   _new_conn       自我,“无法建立新连接:%s”%e)urllib3.exceptions.NewConnectionError:   :失败   建立新连接:[Errno -2]名称或服务未知

     

在处理上述异常期间,发生了另一个异常:

     

回溯(最近通话最近):文件   “ /usr/local/lib/python3.6/dist-packages/requests/adapters.py”,行   449,在发送       timeout = timeout文件“ /usr/lib/python3/dist-packages/urllib3/connectionpool.py”,行639,   在urlopen中       _stacktrace = sys.exc_info()[2])文件“ /usr/lib/python3/dist-packages/urllib3/util/retry.py”,行398,在   增量       引发MaxRetryError(_pool,url,error或ResponseError(cause))urllib3.exceptions.MaxRetryError:   HTTPConnectionPool(host ='www.miet.ac.in%0a',port = 80):最多重试   网址超出了:/(由   NewConnectionError(':无法建立新连接:[Errno -2]名称   或服务未知”,))

     

在处理上述异常期间,发生了另一个异常:

     

回溯(最近一次通话最后一次):文件“ editog.py”,第39行,在   email_harvest       响应= request.get(URL)文件“ /usr/local/lib/python3.6/dist-packages/requests/api.py”,第75行,在   得到       返回请求(“ get”,URL,params = params,** kwargs)文件“ /usr/local/lib/python3.6/dist-packages/requests/api.py”,第60行,在   请求       返回session.request(方法=方法,url = url,** kwargs)文件“ /usr/local/lib/python3.6/dist-packages/requests/sessions.py”,行   533,应要求       resp = self.send(prep,** send_kwargs)文件“ /usr/local/lib/python3.6/dist-packages/requests/sessions.py”,行   在发送646       r = adapter.send(request,** kwargs)文件“ /usr/local/lib/python3.6/dist-packages/requests/adapters.py”,行   516,发送中       引发ConnectionError(e,request = request)requests.exceptions.ConnectionError:   HTTPConnectionPool(host ='www.miet.ac.in%0a',port = 80):最多重试   网址超出了:/(由   NewConnectionError(':无法建立新连接:[Errno -2]名称   或服务未知”,))

注意:

  1. 我没有任何代理人,也没有过滤条件。
  2. 互联网稳定。

2 个答案:

答案 0 :(得分:2)

好像连接正在尝试连接到无效的URL。

HTTPConnectionPool(host ='www.miet.ac.in%0a',port = 80)

此网址('www.miet.ac.in%0a')有效吗? 我可以访问“ www.miet.ac.in”,但不能访问“ www.miet.ac.in%0a”

如果有效,是否还可以添加没有循环的操作?

答案 1 :(得分:1)

host =' www.miet.ac.in%0a ',端口= 80

问题在于您的字符串插值