我想解析一组URL,所以我想连接一个整数,其中页面ID正在改变这样。
在网址中间有%count%
,但似乎无效。我该如何连接它?
count=2
while (count < pages):
mech = Browser()
url = 'http://www.amazon.com/s/ref=sr_pg_%s'% count %'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
url = int(raw_input(url))
mech = Browser()
page = mech.open(url)
soup = BeautifulSoup(page)
print url
for thediv in soup.findAll('li',{'class':' ilo2'}):
links = thediv.find('a')
links = links['href']
print links
count = count+1
我收到此错误:
TypeError: not all arguments converted during string formatting
最终网址格式
http://www.amazon.com/s/ref=sr_pg_2?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491
答案 0 :(得分:2)
%
运算符在python中不起作用。
以下是您应该如何使用它:
url = 'http://....../ref=sr_pg_%s?rh=.............' % (count, )
由于您的网址格式中已经有%
个符号,因此您应该将它们加倍,以便它们不会被python视为占位符:
url = 'http://www.amazon.com/s/ref=sr_pg_%s?rh=n%%3A2858778011%%2Cp_drm_rights%%3APurchase%%7CRental%%2Cn%%3A2858905011%%2Cp_n_date%%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491' % (count, )
话虽如此,有专门用于解析和创建URL的python模块,它名为urllib,你可以在这里找到它的文档:https://docs.python.org/3.3/library/urllib.parse.html
答案 1 :(得分:0)
您的字符串中有urlencoded实体(%3A
等)。您可以尝试使用{}
语法:
url = 'http://.....{}...{}...'.format(first_arg, second_arg)
然后你也会在字符串中看到任何其他问题..
答案 2 :(得分:0)
如果您希望保持字符串不变(不在内部插入变量值),问题可能是由于您使用单引号'
来分隔包含内部引号的字符串。您可以使用双引号:
url = "http://www.amazon.com/s/ref=sr_pg_%s'% count %'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491"
更好的解决方案是逃避报价:
url = 'http://www.amazon.com/s/ref=sr_pg_%s\'% count %\'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
答案 3 :(得分:0)
不应尝试使用原始字符串解析或编辑URL,而应使用专用模块urllib2
(或urllib
,具体取决于python版本。)
这是一个简单的例子,使用OP的url:
from urllib2 import urlparse
original_url = (
"""http://www.amazon.com/s/ref=sr_pg_2?rh=n%3A2858778011%2"""
"""Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date"""
"""%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491""")
parsed = urlparse.urlparse(original_url)
这会返回类似的内容:
ParseResult(
scheme='http', netloc='www.amazon.com', path='/s/ref=sr_pg_2',
params='',
query='rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491', fragment='')
然后我们编辑网址的路径部分
scheme, netloc, path, params, query, fragment = parsed
path = '/s/ref=sr_pg_%d' % (count, )
我们“解析”网址:
new_url = urlparse.urlunparse((scheme, netloc, path, params, query, fragment))
我们有一个新的网址已修改路径:
'http://www.amazon.com/s/ref=sr_pg_423?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'