将整数连接到python中的url给出错误

时间:2014-06-18 08:16:47

标签: python

我想解析一组URL,所以我想连接一个整数,其中页面ID正在改变这样。

在网址中间有%count%,但似乎无效。我该如何连接它?

count=2
while (count < pages):
    mech = Browser()
    url = 'http://www.amazon.com/s/ref=sr_pg_%s'% count %'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
    url = int(raw_input(url))

    mech = Browser()

    page = mech.open(url)

    soup = BeautifulSoup(page)
    print url
    for thediv in soup.findAll('li',{'class':' ilo2'}):
        links = thediv.find('a')
        links = links['href']
        print links
    count = count+1

我收到此错误:

TypeError: not all arguments converted during string formatting

最终网址格式

http://www.amazon.com/s/ref=sr_pg_2?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491

4 个答案:

答案 0 :(得分:2)

%运算符在python中不起作用。

以下是您应该如何使用它:

url = 'http://....../ref=sr_pg_%s?rh=.............' % (count, )

由于您的网址格式中已经有%个符号,因此您应该将它们加倍,以便它们不会被python视为占位符:

url = 'http://www.amazon.com/s/ref=sr_pg_%s?rh=n%%3A2858778011%%2Cp_drm_rights%%3APurchase%%7CRental%%2Cn%%3A2858905011%%2Cp_n_date%%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491' % (count, )

话虽如此,有专门用于解析和创建URL的python模块,它名为urllib,你可以在这里找到它的文档:https://docs.python.org/3.3/library/urllib.parse.html

答案 1 :(得分:0)

您的字符串中有urlencoded实体(%3A等)。您可以尝试使用{}语法:

url = 'http://.....{}...{}...'.format(first_arg, second_arg)

然后你也会在字符串中看到任何其他问题..

答案 2 :(得分:0)

如果您希望保持字符串不变(不在内部插入变量值),问题可能是由于您使用单引号'来分隔包含内部引号的字符串。您可以使用双引号:

url = "http://www.amazon.com/s/ref=sr_pg_%s'% count %'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491"

更好的解决方案是逃避报价:

url = 'http://www.amazon.com/s/ref=sr_pg_%s\'% count %\'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'

答案 3 :(得分:0)

不应尝试使用原始字符串解析或编辑URL,而应使用专用模块urllib2(或urllib,具体取决于python版本。)

这是一个简单的例子,使用OP的url:

from urllib2 import urlparse
original_url = (
    """http://www.amazon.com/s/ref=sr_pg_2?rh=n%3A2858778011%2"""
    """Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date"""
    """%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491""")
parsed = urlparse.urlparse(original_url)

这会返回类似的内容:

ParseResult(
    scheme='http', netloc='www.amazon.com', path='/s/ref=sr_pg_2',
    params='',
    query='rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491', fragment='')

然后我们编辑网址的路径部分

scheme, netloc, path, params, query, fragment = parsed
path = '/s/ref=sr_pg_%d' % (count, )

我们“解析”网址:

new_url = urlparse.urlunparse((scheme, netloc, path, params, query, fragment))

我们有一个新的网址已修改路径:

'http://www.amazon.com/s/ref=sr_pg_423?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'