我有以下网址
当您将上述网址放入浏览器并按Enter键时,它将重定向到以下网址 http://www.kennystopproducts.info/Top/?hop=arnishad
但是当我尝试通过python程序(下面你可以看到代码)找到相同的上面url http://bit.ly/cDdh1c之后找到基本网址(在删除所有重定向网址之后)时,我得到以下网址{ {3}}作为基本网址。请参阅下面的日志文件
为什么同一个url在浏览器和python程序中表现不同。我应该在python程序中更改哪些内容以便它可以重定向到正确的url?我想知道这种奇怪的行为是如何发生的。?
我观察到类似行为的其他网址
http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=150413977509(通过python 程序)
maxattempts = 5
turl = url
while (maxattempts > 0) :
host,path = urlparse.urlsplit(turl)[1:3]
if len(host.strip()) == 0 :
return None
try:
connection = httplib.HTTPConnection(host,timeout=10)
connection.request("HEAD", path)
resp = connection.getresponse()
except:
return None
maxattempts = maxattempts - 1
if (resp.status >= 300) and (resp.status <= 399):
self.logger.debug("The present %s is a redirection one" %turl)
turl = resp.getheader('location')
elif (resp.status >= 200) and (resp.status <= 299) :
self.logger.debug("The present url %s is a proper one" %turl)
return turl
else :
#some problem with this url
return None
return None
日志文件供您参考
2010-02-14 10:29:43,260 - paypallistener.views.MrCrawler - INFO - Bringing down the base URL
2010-02-14 10:29:43,261 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://bit.ly/cDdh1c
2010-02-14 10:29:43,994 - paypallistener.views.MrCrawler - DEBUG - The present http://bit.ly/cDdh1c is a redirection one
2010-02-14 10:29:43,995 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter
2010-02-14 10:29:44,606 - paypallistener.views.MrCrawler - DEBUG - The present http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter is a redirection one
2010-02-14 10:29:44,607 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/
2010-02-14 10:29:45,547 - paypallistener.views.MrCrawler - DEBUG - The present url http://www.cbtrends.com/ is a proper one
http://www.cbtrends.com/
答案 0 :(得分:1)
您的问题是,当您调用urlsplit时,您的路径变量只包含路径并且缺少查询。
所以,请尝试:
import httplib
import urlparse
def getUrl(url):
maxattempts = 10
turl = url
while (maxattempts > 0) :
host,path,query = urlparse.urlsplit(turl)[1:4]
if len(host.strip()) == 0 :
return None
try:
connection = httplib.HTTPConnection(host,timeout=10)
connection.request("GET", path+'?'+query)
resp = connection.getresponse()
except:
return None
maxattempts = maxattempts - 1
if (resp.status >= 300) and (resp.status <= 399):
turl = resp.getheader('location')
elif (resp.status >= 200) and (resp.status <= 299) :
return turl
else :
#some problem with this url
return None
return None
print getUrl('http://bit.ly/cDdh1c')
答案 1 :(得分:1)
你的问题来自这一行:
host,path = urlparse.urlsplit(turl)[1:3]
你要省略查询字符串。因此,在您提供的示例日志中,您将执行的第二个HEAD
请求将在http://www.cbtrends.com/get-product.html
上没有GET参数。在浏览器中打开该网址,您会看到它重定向到http://www.cbtrends.com/
。
您必须使用urlsplit
返回的元组的所有元素来计算路径。
parts = urlparse.urlsplit(turl)
host = parts[1]
path = "%s?%s#%s" % parts[2:5]