剪贴API时获取下一页

时间:2018-07-04 00:34:24

标签: python-3.x api web-scraping

我正在尝试使用分页并在刮完当前页面后转到下一页。这是我第一次抓取API,所以我有点迷茫,还没有在互联网上找到任何东西。

问题:我需要做什么才能进入下一页

API:https://games.crossfit.com/competitions/api/v1/competitions/open/2018/leaderboards?division=2&region=0&scaled=0&sort=0&occupation=0&page=1

代码(到目前为止我所拥有的):

import pandas as pd
import requests, re
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import json

url = 'https://games.crossfit.com/competitions/api/v1/competitions/open/2018/leaderboards?division=1&region=0&scaled=0&sort=0&occupation=0&page=1'

nameList = []
genderList = []
regionList = []
gymList = []
ageList = []
heightList = []
weightList = []
ordList = []
overallList = []
overallScoreList = []

response = requests.get(url)
data = response.text
parsed = json.loads(data)

year = parsed['competition']['year']
comp = parsed['competition']['competitionType']
year = parsed['competition']['year']
board = parsed['leaderboardRows']
for all in board:
    name = all['entrant']['competitorName']
    gender = all['entrant']['gender']
    region = all['entrant']['regionName']
    gym = all['entrant']['affiliateName']
    age = all['entrant']['age']
    overall = all['overallRank']
    overallS = all['overallScore']
    height = all['entrant']['height']
    weight = all['entrant']['weight']

    nameList.append(name)
    genderList.append(gender)
    regionList.append(region)
    gymList.append(gym)
    ageList.append(age)
    heightList.append(height)
    weightList.append(weight)
    overallList.append(overall)
    overallScoreList.append(overallS)

2 个答案:

答案 0 :(得分:2)

crossfit API在pagination部分中提供了所有必要的信息。它给你这样的东西:

"pagination":
    {
        "currentPage":1,
        "totalPages":3440,
        "totalCompetitors":171977
    },

要获取除1以外的页面,您需要在url中更改GET参数: 代替&page=1,写&page=2。最好使用可以传递相关参数的函数来构建网址,例如 url_for_page(20)将返回 https://games.crossfit.com/competitions/api/v1/competitions/open/2018/leaderboards?division=2&region=0&scaled=0&sort=0&occupation=0&page=20

希望您会有所帮助。

答案 1 :(得分:1)

快速简便的方法如下所示:

import requests

url = 'https://games.crossfit.com/competitions/api/v1/competitions/open/2018/leaderboards?division=1&region=0&scaled=0&sort=0&occupation=0&page={}'

for link in [url.format(page) for page in range(1,5)]:
    response = requests.get(link)
    for item in response.json()['leaderboardRows']:
        name = item['entrant']['competitorName']
        print(name)