取消缩短python 3.x中大型数据集的url,socket.gaierror,getaddrinfo失败

时间:2018-07-18 15:20:19

标签: python url python-requests urllib

在Windows 10上使用python 3.6,我正在尝试评估一列网址。 我有一个带有一列网址的csv文件。其中一些被缩短。如果您想重现结果,则可以为这些网址创建一个包含一列的csv文件:

external_urls
http##://rviv.ly/NdL..
http##://rviv.ly/kDH..
http##://rviv.ly/GA7..
http##://rviv.ly/zCZ...
http##://rviv.ly/46HW...
http://bit####ly/2GzanWa # replace the '###' with '.' Links to https://www.careerarc.com/job-search/linquest-corporation-jobs.html?listing_not_found=true
https##://www.sec.gov/news/press-release/2018-41

我的实际表非常大,有大约100000多个URL可供评估。以下代码似乎崩溃不一致(将对此进行验证,但我发誓昨晚我因失败而获得不同的发言)。它给出了以下错误

错误回溯:

\lib\socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11001] getaddrinfo failed

我的代码

 import http.client
 from urllib.parse import urlparse
 print("calculating most frequent url domains")
 df = pd.read_csv(PATH_TO_Hq_CSV)
 clean_url_lst = []
 domain_lst = []
 domain_dict = {}
 for urls_ in df['external_url']:
    print(urls_)
    if str(urls_) == "nan":
        continue
    else:
        o = unshorten_url(str(urls_))
        print("URL: \t", str(o)) # still prints the shortened url

def unshorten_url(url):
   parsed = urlparse(url)
   h = http.client.HTTPConnection(parsed.netloc)
   h.request('HEAD', parsed.path) # error traces to this line
   response = h.getresponse()
   if response.status/100 == 3 and response.getheader('Location'):
     return response.getheader('Location')
   else:
     return url      

此错误是什么意思?

我认为,除非能获得针对此错误的一般解决方法,否则我将无法取消python 3中的网址。

1 个答案:

答案 0 :(得分:0)

为什么不尝试使用requests package

import requests
url = 'http://fb.com'
try:
    response = requests.get(url)
except Exception as e:
    print('Bad url {url}. {e}'.format(url=url, e=e))


print(response.url)
print([redirect.url for redirect in response.history])

"""
# Output
>> https://www.facebook.com/ 
>> ['http://fb.com/', 'https://fb.com/']

"""