在Windows 10上使用python 3.6,我正在尝试评估一列网址。 我有一个带有一列网址的csv文件。其中一些被缩短。如果您想重现结果,则可以为这些网址创建一个包含一列的csv文件:
external_urls
http##://rviv.ly/NdL..
http##://rviv.ly/kDH..
http##://rviv.ly/GA7..
http##://rviv.ly/zCZ...
http##://rviv.ly/46HW...
http://bit####ly/2GzanWa # replace the '###' with '.' Links to https://www.careerarc.com/job-search/linquest-corporation-jobs.html?listing_not_found=true
https##://www.sec.gov/news/press-release/2018-41
我的实际表非常大,有大约100000多个URL可供评估。以下代码似乎崩溃不一致(将对此进行验证,但我发誓昨晚我因失败而获得不同的发言)。它给出了以下错误
错误回溯:
\lib\socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11001] getaddrinfo failed
我的代码
import http.client
from urllib.parse import urlparse
print("calculating most frequent url domains")
df = pd.read_csv(PATH_TO_Hq_CSV)
clean_url_lst = []
domain_lst = []
domain_dict = {}
for urls_ in df['external_url']:
print(urls_)
if str(urls_) == "nan":
continue
else:
o = unshorten_url(str(urls_))
print("URL: \t", str(o)) # still prints the shortened url
def unshorten_url(url):
parsed = urlparse(url)
h = http.client.HTTPConnection(parsed.netloc)
h.request('HEAD', parsed.path) # error traces to this line
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return response.getheader('Location')
else:
return url
此错误是什么意思?
我认为,除非能获得针对此错误的一般解决方法,否则我将无法取消python 3中的网址。
答案 0 :(得分:0)
为什么不尝试使用requests package。
import requests
url = 'http://fb.com'
try:
response = requests.get(url)
except Exception as e:
print('Bad url {url}. {e}'.format(url=url, e=e))
print(response.url)
print([redirect.url for redirect in response.history])
"""
# Output
>> https://www.facebook.com/
>> ['http://fb.com/', 'https://fb.com/']
"""