使用Python更改URL中的查询

时间:2017-07-09 19:48:43

标签: python python-2.7 url

我需要更新URL的查询部分(page_index =)。我尝试了下面显示的几种方法,但我碰到了一堵墙。我是python的新手,正在寻找指导。页面索引的范围是0 - 511(每天添加新的),我需要更新url以循环遍历所有索引。索引始终从0开始。

import urlparse

url = 'https://api.appannie.com/v1.2/apps/ios/app/331177714/reviews?
start_date=2016-1-01&end_date=2017-8-26&page_index=0&countries=US'
parts = urlparse.urlparse(url)
parts = parts._replace(query = page_index [2])
parts.geturl()

我收到错误:

TypeError Traceback (most recent call last)
<ipython-input-29-066332f37bb3> in <module>()
  3 url = 'https://api.appannie.com/v1.2/apps/ios/app/331177714/reviews?start_date=2016-1-01&end_date=2017-8-26&page_index=0&countries=US'
  4 parts = urlparse.urlparse(url)
----> 5 parts = parts._replace(query = page_index [2])
  6 parts.geturl()
  7
TypeError: 'function' object has no attribute '__getitem__'

2 个答案:

答案 0 :(得分:1)

最简单的方法,只需直接修改网址:

base_url = "https://api.appannie.com/v1.2/apps/ios/app/331177714/reviews?start_date=2016-1-01&end_date=2017-8-26&page_index={}&countries=US"

for pi in range(512):
    this_url = base_url.format(pi)
    # now get it

稍微复杂但更容易定制的方式 - 将参数作为dict传递:

import requests

url = "https://api.appannie.com/v1.2/apps/ios/app/331177714/reviews"
params = {
    "start_date": "2016-1-01",
    "end_date"  : "2017-8-26"
    "countries" : "US"
}

for pi in range(512):
    params["page_index"] = pi
    res = requests.get(url, params)
    if res.ok:
        html = res.text

答案 1 :(得分:1)

您必须提取urlparse()结果的query组件并对其进行修改,然后重新构建一个新URL,如下所示:

pr = urlparse.urlparse(url)
parts = pr.query.split('&')
parts[2] = 'page_index=2'
new_url = urlparse.urlunparse([pr.scheme, pr.netloc, pr.path, pr.params, "&".join(parts), pr.fragment])

要遍历所有页码,请遍历最后两行,以获取所需的任何页码范围。