我正在尝试运行电子邮件收集器,当我手动输入URL而没有循环时,我没有任何连接错误。
import re
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup
def email_harvest(starting_url):
# starting url. replace google with your own url.
#starting_url = 'http://www.miet.ac.in'
print ('this is the starting urli '+starting_url)
#starting_url = website_url[i]
# i += 1
# a queue of urls to be crawled
unprocessed_urls = deque([starting_url])
# set of already crawled urls for email
processed_urls = set()
# a set of fetched emails
emails = set()
# process urls one by one from unprocessed_url queue until queue is empty
while len(unprocessed_urls):
# move next url from the queue to the set of processed urls
url = unprocessed_urls.popleft()
processed_urls.add(url)
# extract base url to resolve relative links
parts = urlsplit(url)
base_url = "{0.scheme}://{0.netloc}".format(parts)
path = url[:url.rfind('/')+1] if '/' in parts.path else url
print (url)
# get url's content
#print("Crawling URL %s" % url)
try:
response = requests.get(url)
print (response.status_code)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
# ignore pages with errors and continue with next url
print ("error crawing " % url)
continue
# extract all email addresses and add them into the resulting set
# You may edit the regular expression as per your requirement
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
emails.update(new_emails)
print(emails)
# create a beutiful soup for the html document
soup = BeautifulSoup(response.text, 'lxml')
# Once this document is parsed and processed, now find and process all the anchors i.e. linked urls in this document
for anchor in soup.find_all("a"):
# extract link url from the anchor
link = anchor.attrs["href"] if "href" in anchor.attrs else ''
# resolve relative links (starting with /)
if link.startswith('/'):
link = base_url + link
elif not link.startswith('http'):
link = path + link
# add the new url to the queue if it was not in unprocessed list nor in processed list yet
if not link in unprocessed_urls and not link in processed_urls:
unprocessed_urls.append(link)
website_url = tuple(open('text.txt','r'))
i = 0
while i < (len(website_url)+1):
print (i)
starting_url = 'http://'+ website_url[i]
email_harvest(starting_url)
i +=1
但是,当我从文件加载URL时,出现以下错误“名称或服务错误”
回溯(最近通话最近):文件 “ /usr/lib/python3/dist-packages/urllib3/connection.py”,第141行,在 _new_conn (self.host,self.port),self.timeout,** extra_kw)文件“ /usr/lib/python3/dist-packages/urllib3/util/connection.py”,第60行, 在create_connection中 用于socket.getaddrinfo(主机,端口,家庭,socket.SOCK_STREAM)中的res:文件“ /usr/lib/python3.6/socket.py”,行745, 在getaddrinfo中 _socket.getaddrinfo中的res(主机,端口,家庭,类型,原型,标志):socket.gaierror:[Errno -2]名称或服务未知
在处理上述异常期间,发生了另一个异常:
回溯(最近通话最近):文件 “ /usr/lib/python3/dist-packages/urllib3/connectionpool.py”,第601行, 在urlopen中 chunked = chunked)文件“ /usr/lib/python3/dist-packages/urllib3/connectionpool.py”,第357行, 在_make_request中 conn.request(方法,URL,** httplib_request_kw)文件“ /usr/lib/python3.6/http/client.py”,第1254行,在请求中 self._send_request(方法,URL,正文,标头,encode_chunked)文件“ /usr/lib/python3.6/http/client.py”,行1300,在_send_request中 self.endheaders(body,encode_chunked = encode_chunked)文件“ /usr/lib/python3.6/http/client.py”,行1249,在endheaders中 self._send_output(message_body,encode_chunked = encode_chunked)文件“ /usr/lib/python3.6/http/client.py”,行1036,在_send_output中 self.send(msg)发送中的文件“ /usr/lib/python3.6/http/client.py”,第974行 self.connect()文件“ /usr/lib/python3/dist-packages/urllib3/connection.py”,行166,在 连接 conn = self._new_conn()文件“ /usr/lib/python3/dist-packages/urllib3/connection.py”,第150行,在 _new_conn 自我,“无法建立新连接:%s”%e)urllib3.exceptions.NewConnectionError: :失败 建立新连接:[Errno -2]名称或服务未知
在处理上述异常期间,发生了另一个异常:
回溯(最近通话最近):文件 “ /usr/local/lib/python3.6/dist-packages/requests/adapters.py”,行 449,在发送 timeout = timeout文件“ /usr/lib/python3/dist-packages/urllib3/connectionpool.py”,行639, 在urlopen中 _stacktrace = sys.exc_info()[2])文件“ /usr/lib/python3/dist-packages/urllib3/util/retry.py”,行398,在 增量 引发MaxRetryError(_pool,url,error或ResponseError(cause))urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host ='www.miet.ac.in%0a',port = 80):最多重试 网址超出了:/(由 NewConnectionError(':无法建立新连接:[Errno -2]名称 或服务未知”,))
在处理上述异常期间,发生了另一个异常:
回溯(最近一次通话最后一次):文件“ editog.py”,第39行,在 email_harvest 响应= request.get(URL)文件“ /usr/local/lib/python3.6/dist-packages/requests/api.py”,第75行,在 得到 返回请求(“ get”,URL,params = params,** kwargs)文件“ /usr/local/lib/python3.6/dist-packages/requests/api.py”,第60行,在 请求 返回session.request(方法=方法,url = url,** kwargs)文件“ /usr/local/lib/python3.6/dist-packages/requests/sessions.py”,行 533,应要求 resp = self.send(prep,** send_kwargs)文件“ /usr/local/lib/python3.6/dist-packages/requests/sessions.py”,行 在发送646 r = adapter.send(request,** kwargs)文件“ /usr/local/lib/python3.6/dist-packages/requests/adapters.py”,行 516,发送中 引发ConnectionError(e,request = request)requests.exceptions.ConnectionError: HTTPConnectionPool(host ='www.miet.ac.in%0a',port = 80):最多重试 网址超出了:/(由 NewConnectionError(':无法建立新连接:[Errno -2]名称 或服务未知”,))
注意:
答案 0 :(得分:2)
好像连接正在尝试连接到无效的URL。
HTTPConnectionPool(host ='www.miet.ac.in%0a',port = 80)
此网址('www.miet.ac.in%0a')有效吗? 我可以访问“ www.miet.ac.in”,但不能访问“ www.miet.ac.in%0a”
如果有效,是否还可以添加没有循环的操作?
答案 1 :(得分:1)
host =' www.miet.ac.in%0a ',端口= 80
问题在于您的字符串插值