Question

我正在尝试运行电子邮件收集器，当我手动输入URL而没有循环时，我没有任何连接错误。

import re
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup


def email_harvest(starting_url):
    # starting url. replace google with your own url.
    #starting_url = 'http://www.miet.ac.in'
    print ('this is the starting urli '+starting_url)   
    #starting_url = website_url[i]
#   i += 1
    # a queue of urls to be crawled
    unprocessed_urls = deque([starting_url])

    # set of already crawled urls for email
    processed_urls = set()

    # a set of fetched emails
    emails = set()

    # process urls one by one from unprocessed_url queue until queue is empty
    while len(unprocessed_urls):

        # move next url from the queue to the set of processed urls
        url = unprocessed_urls.popleft()
        processed_urls.add(url)

        # extract base url to resolve relative links
        parts = urlsplit(url)
        base_url = "{0.scheme}://{0.netloc}".format(parts)
        path = url[:url.rfind('/')+1] if '/' in parts.path else url
        print (url)
        # get url's content
        #print("Crawling URL %s" % url)
        try:
            response = requests.get(url)
            print (response.status_code)
        except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
            # ignore pages with errors and continue with next url
            print ("error crawing " % url)
            continue

        # extract all email addresses and add them into the resulting set
        # You may edit the regular expression as per your requirement
        new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
        emails.update(new_emails)
        print(emails)
        # create a beutiful soup for the html document
        soup = BeautifulSoup(response.text, 'lxml')

        # Once this document is parsed and processed, now find and process all the anchors i.e. linked urls in this document
        for anchor in soup.find_all("a"):
            # extract link url from the anchor
            link = anchor.attrs["href"] if "href" in anchor.attrs else ''
            # resolve relative links (starting with /)
            if link.startswith('/'):
                link = base_url + link
            elif not link.startswith('http'):
                link = path + link
            # add the new url to the queue if it was not in unprocessed list nor in processed list yet
            if not link in unprocessed_urls and not link in processed_urls:
                unprocessed_urls.append(link)


website_url = tuple(open('text.txt','r'))
i = 0
while i < (len(website_url)+1):
    print (i)
    starting_url = 'http://'+ website_url[i]
    email_harvest(starting_url)
    i +=1

但是，当我从文件加载URL时，出现以下错误“名称或服务错误”

回溯（最近通话最近）：文件   “ /usr/lib/python3/dist-packages/urllib3/connection.py”，第141行，在   _new_conn       （self.host，self.port），self.timeout，** extra_kw）文件“ /usr/lib/python3/dist-packages/urllib3/util/connection.py”，第60行，   在create_connection中       用于socket.getaddrinfo（主机，端口，家庭，socket.SOCK_STREAM）中的res：文件“ /usr/lib/python3.6/socket.py”，行745，   在getaddrinfo中       _socket.getaddrinfo中的res（主机，端口，家庭，类型，原型，标志）：socket.gaierror：[Errno -2]名称或服务未知

在处理上述异常期间，发生了另一个异常：

回溯（最近通话最近）：文件   “ /usr/lib/python3/dist-packages/urllib3/connectionpool.py”，第601行，   在urlopen中       chunked = chunked）文件“ /usr/lib/python3/dist-packages/urllib3/connectionpool.py”，第357行，   在_make_request中       conn.request（方法，URL，** httplib_request_kw）文件“ /usr/lib/python3.6/http/client.py”，第1254行，在请求中       self._send_request（方法，URL，正文，标头，encode_chunked）文件“ /usr/lib/python3.6/http/client.py”，行1300，在_send_request中       self.endheaders（body，encode_chunked = encode_chunked）文件“ /usr/lib/python3.6/http/client.py”，行1249，在endheaders中       self._send_output（message_body，encode_chunked = encode_chunked）文件“ /usr/lib/python3.6/http/client.py”，行1036，在_send_output中       self.send（msg）发送中的文件“ /usr/lib/python3.6/http/client.py”，第974行       self.connect（）文件“ /usr/lib/python3/dist-packages/urllib3/connection.py”，行166，在   连接       conn = self._new_conn（）文件“ /usr/lib/python3/dist-packages/urllib3/connection.py”，第150行，在   _new_conn       自我，“无法建立新连接：％s”％e）urllib3.exceptions.NewConnectionError：   ：失败   建立新连接：[Errno -2]名称或服务未知

在处理上述异常期间，发生了另一个异常：

回溯（最近通话最近）：文件   “ /usr/local/lib/python3.6/dist-packages/requests/adapters.py”，行   449，在发送       timeout = timeout文件“ /usr/lib/python3/dist-packages/urllib3/connectionpool.py”，行639，   在urlopen中       _stacktrace = sys.exc_info（）[2]）文件“ /usr/lib/python3/dist-packages/urllib3/util/retry.py”，行398，在   增量       引发MaxRetryError（_pool，url，error或ResponseError（cause））urllib3.exceptions.MaxRetryError：   HTTPConnectionPool（host ='www.miet.ac.in％0a'，port = 80）：最多重试   网址超出了：/（由   NewConnectionError（'：无法建立新连接：[Errno -2]名称   或服务未知”，））

在处理上述异常期间，发生了另一个异常：

回溯（最近一次通话最后一次）：文件“ editog.py”，第39行，在   email_harvest       响应= request.get（URL）文件“ /usr/local/lib/python3.6/dist-packages/requests/api.py”，第75行，在   得到       返回请求（“ get”，URL，params = params，** kwargs）文件“ /usr/local/lib/python3.6/dist-packages/requests/api.py”，第60行，在   请求       返回session.request（方法=方法，url = url，** kwargs）文件“ /usr/local/lib/python3.6/dist-packages/requests/sessions.py”，行   533，应要求       resp = self.send（prep，** send_kwargs）文件“ /usr/local/lib/python3.6/dist-packages/requests/sessions.py”，行   在发送646       r = adapter.send（request，** kwargs）文件“ /usr/local/lib/python3.6/dist-packages/requests/adapters.py”，行   516，发送中       引发ConnectionError（e，request = request）requests.exceptions.ConnectionError：   HTTPConnectionPool（host ='www.miet.ac.in％0a'，port = 80）：最多重试   网址超出了：/（由   NewConnectionError（'：无法建立新连接：[Errno -2]名称   或服务未知”，））

注意：

我没有任何代理人，也没有过滤条件。
互联网稳定。

Answer 1

好像连接正在尝试连接到无效的URL。

HTTPConnectionPool（host ='www.miet.ac.in％0a'，port = 80）

此网址（'www.miet.ac.in％0a'）有效吗？我可以访问“ www.miet.ac.in”，但不能访问“ www.miet.ac.in％0a”

如果有效，是否还可以添加没有循环的操作？

Answer 2

host =' www.miet.ac.in％0a '，端口= 80

问题在于您的字符串插值

Python3无法建立连接socket.gaierror：名称或服务未知

2 个答案: