Question

当我尝试循环获取GitHub项目的参与者数量时，遇到索引超出范围错误。经过一些迭代（运行良好）后，它只会抛出该异常。我不知道为什么...

    for x in range(100):
        r = requests.get('https://github.com/tipsy/profile-summary-for-github')  
        xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
        contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
        print(contributors_number) # prints the correct number until the exception

这是例外。

----> 4     contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range

Answer 1

GitHub阻止了您的重复请求。不要连续抓取网站，许多网站运营商会积极阻止太多请求。结果，返回的内容不再与您的XPath查询匹配。

您应该使用REST API that GitHub provides来检索项目统计信息，例如贡献者的数量，并且应该实施某种速率限制。无需检索相同的次数100次，贡献者的数量不会迅速改变。

API响应include information on how many requests you can make in a time window，并且您可以使用conditional requests仅在数据实际更改时产生速率限制费用：

import requests
import time
from urllib.parse import parse_qsl, urlparse

owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....'   # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'

with requests.session() as sess:
    # GitHub requests you use your username or appname in the header
    sess.headers['User-Agent'] += ' - {}'.format(github_username)
    # Consider logging in! You'll get more quota
    # sess.auth = (github_username, token)

    # start with the first, move to the last when available, include anonymous
    last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'

    while True:
        r = sess.get(last_page)
        if r.status_code == requests.codes.not_found:
            print("No such repo")
            break
        if r.status_code == requests.codes.no_content:
            print("No contributors, repository is empty")
            break
        if r.status_code == requests.codes.accepted:
            print("Stats not yet ready, retrying")
        elif r.status_code == requests.codes.not_modified:
            print("Stats not changed")
        elif r.ok:
            # success! Check for a last page, get that instead of current
            # to get accurate count
            link_last = r.links.get('last', {}).get('url')
            if link_last and r.url != link_last:
                last_page = link_last
            else:
                # this is the last page, report on count
                params = dict(parse_qsl(urlparse(r.url).query))
                page_num = int(params.get('page', '1'))
                per_page = int(params.get('per_page', '100'))
                contributor_count = len(r.json()) + (per_page * (page_num - 1))
                print("Contributor count:", contributor_count)
            # only get us a fresh response next time
            sess.headers['If-None-Match'] = r.headers['ETag']

        # pace ourselves following the rate limit
        window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
        rate_remaining = int(r.headers['X-RateLimit-Remaining'])
        # sleep long enough to honour the rate limit or at least 100 milliseconds
        time.sleep(max(window_remaining / rate_remaining, 0.1))

以上内容使用请求session object处理重复的标头，并确保尽可能重用连接。

github3.py（由requests核心贡献者偶然编写）之类的优质库将为您处理大部分这些细节。

如果您确实要继续直接抓取网站，则可能会冒着网站运营商完全封锁您的风险。尝试承担一些责任，不要持续锤击该网站。

这意味着至少，您应该尊重GitHub在429年给您的Retry-After header：

if not r.ok:
    print("Received a response other that 200 OK:", r.status_code, r.reason)
    retry_after = r.headers.get('Retry-After')
    if retry_after is not None:
        print("Response included a Retry-After:", retry_after)
        time.sleep(int(retry_after))
else:
    # parse OK response

Answer 2

现在，使用API时，这对我来说非常合适。可能是最干净的方法。

import requests
import json

url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)

commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
    page_number += 1
    url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
    response = requests.get(url)
    commits = json.loads(response.text)
    commits_total += len(commits)

循环发送请求时索引超出范围

2 个答案: