我希望能够获取缩短或未缩短的网址并返回其未缩短的表单。如何制作python程序来执行此操作?
额外澄清:
e.g。输入数组中的bit.ly/silly
应该是输出数组中的google.com
例如输入数组中的google.com
应该是输出数组中的google.com
答案 0 :(得分:35)
向URL发送HTTP HEAD请求并查看响应代码。如果代码是30x,请查看Location
标头以获取未经过缩短的URL。否则,如果代码是20x,则不会重定向URL;您可能还想以某种方式处理错误代码(4xx和5xx)。例如:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request('HEAD', parsed.path)
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return response.getheader('Location')
else:
return url
答案 1 :(得分:20)
使用请求:
import requests
session = requests.Session() # so connections are recycled
resp = session.head(url, allow_redirects=True)
print(resp.url)
答案 2 :(得分:5)
Unshorten.me有一个api,允许您发送JSON或XML请求并返回完整的URL。
答案 3 :(得分:4)
打开网址,看看它解决了什么:
>>> import urllib2
>>> a = urllib2.urlopen('http://bit.ly/cXEInp')
>>> print a.url
http://www.flickr.com/photos/26432908@N00/346615997/sizes/l/
>>> a = urllib2.urlopen('http://google.com')
>>> print a.url
http://www.google.com/
答案 4 :(得分:2)
如果您使用的是Python 3.5+,则可以使用Unshortenit模块,该操作非常简单:
from unshortenit import UnshortenIt
unshortener = UnshortenIt()
uri = unshortener.unshorten('https://href.li/?https://example.com')
答案 5 :(得分:1)
http://github.com/stef/urlclean
sudo pip install urlclean
urlclean.unshorten(url)
答案 6 :(得分:1)
这里的src代码几乎考虑了有用的极端情况:
src代码在github @ https://github.com/amirkrifa/UnShortenUrl
上欢迎评论......
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None
答案 7 :(得分:1)
要取消联系,您可以使用请求。这是一个适合我的简单解决方案。
import requests
url = "http://foo.com"
site = requests.get(url)
print(site.url)
答案 8 :(得分:1)
您可以使用geturl()
from urllib.request import urlopen
url = "bit.ly/silly"
unshortened_url = urlopen(url).geturl()
print(unshortened_url)
# google.com