使用python从缩短网址获取完整网址

时间:2014-08-11 14:22:22

标签: python

我有网址列表,

l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']

我只想查看该列表中每个元素的短网址。

这是我的方法,

import urllib2

for i in l:
    print urllib2.urlopen(i).url

但是当list包含数千个url时,该程序需要很长时间。

我的问题:有没有办法减少执行时间或我必须遵循的任何其他方法?

2 个答案:

答案 0 :(得分:7)

第一种方法

正如所建议的,完成任务的一种方法是使用official api to bitly,但是有限制(例如,每个请求不超过15 shortUrl个)。

第二种方法

作为替代方案,可以避免获取内容,例如使用HEAD HTTP方法代替GET。这里只是一个示例代码,它使用了优秀的requests包:

import requests

l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']

for i in l:
    print requests.head("http://"+i).headers['location']

答案 1 :(得分:0)

我会尝试使用twisted的异步Web客户端。但是要小心,它根本没有速率限制。

#!/usr/bin/python2.7

from twisted.internet import reactor
from twisted.internet.defer import Deferred, DeferredList, DeferredLock
from twisted.internet.defer import inlineCallbacks
from twisted.web.client import Agent, HTTPConnectionPool
from twisted.web.http_headers import Headers
from pprint import pprint
from collections import defaultdict
from urlparse import urlparse
from random import randrange
import fileinput

pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 16
agent = Agent(reactor, pool)
locks = defaultdict(DeferredLock)
locations = {}

def getLock(url, simultaneous = 1):
    return locks[urlparse(url).netloc, randrange(simultaneous)]

@inlineCallbacks
def getMapping(url):
    # Limit ourselves to 4 simultaneous connections per host
    # Tweak this as desired, but make sure that it no larger than
    # pool.maxPersistentPerHost
    lock = getLock(url,4)
    yield lock.acquire()
    try:
        resp = yield agent.request('HEAD', url)
        locations[url] = resp.headers.getRawHeaders('location',[None])[0]
    except Exception as e:
        locations[url] = str(e)
    finally:
        lock.release()


dl = DeferredList(getMapping(url.strip()) for url in fileinput.input())
dl.addCallback(lambda _: reactor.stop())

reactor.run()
pprint(locations)