Question

有时候，我正在使用的某个软件（例如linkchecker）的原始GitHub存储库几乎没有开发，也没有开发，但是创建了很多fork（在本例中为142，撰写本文时）。

对于每个叉子，我想知道：

具有提交功能的分支在原始master分支之前

，并针对每个此类分支：

有多少次提交比原始提交
它背后有多少承诺

GitHub has a web interface for comparing forks，但我不想为每个分叉手动执行此操作，我只想要一个包含所有分叉结果的CSV文件。如何编写脚本？ The GitHub API can list the forks，但我看不到如何将叉子与之比较。依次克隆每个fork并在本地进行比较似乎有些粗糙。

Answer 1

聚会迟到了 - 我想这是我第二次在这个 SO 帖子上结束，所以我将分享我的基于 js 的解决方案（我最终通过获取和搜索 html 页面制作了一个书签） . 您可以从中创建一个 bookmarklet，也可以简单地将整个内容粘贴到控制台中。适用于基于铬的和 Firefox：

编辑：如果页面上有超过 10 个左右的分叉，您可能会因抓取速度过快而被锁定（网络中 429 个请求过多）。改用异步/等待：

javascript:(async () => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const forks = [...document.querySelectorAll('div.repo a:last-of-type')].map(x => x.href).slice(1);

  for (const fork of forks) {
    /* fetch the forked repo as html, search for the "This branch is [n commits ahead,] [m commits behind]", print it to console */
    await fetch(fork)
      .then(x => x.text())
      .then(html => console.log(`${fork}: ${html.match(/This branch is.*/).pop().replace('This branch is ', '')}`))
      .catch(console.error);
  }
})();

或者你可以批量处理，但很容易被锁定

javascript:(async () => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const forks = [...document.querySelectorAll('div.repo a:last-of-type')].map(x => x.href).slice(1);

  getfork = (fork) => {
    return fetch(fork)
      .then(x => x.text())
      .then(html => console.log(`${fork}: ${html.match(/This branch is.*/).pop().replace('This branch is ', '')}`))
      .catch(console.error);
  }

  while (forks.length) {
    await Promise.all(forks.splice(0, 2).map(getfork));
  }
})();

原始（这会立即触发所有请求，如果请求数超过 github 允许的数量，则可能会将您锁定）

javascript:(() => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const forks = [...document.querySelectorAll('div.repo a:last-of-type')].map(x => x.href).slice(1);

  for (const fork of forks) {
    /* fetch the forked repo as html, search for the "This branch is [n commits ahead,] [m commits behind]", print it to console */
    fetch(fork)
      .then(x => x.text())
      .then(html => console.log(`${fork}: ${html.match(/This branch is.*/).pop().replace('This branch is ', '')}`))
      .catch(console.error);
  }
})();

将打印如下内容：

https://github.com/user1/repo: 289 commits behind original:master.
https://github.com/user2/repo: 489 commits behind original:master.
https://github.com/user2/repo: 1 commit ahead, 501 commits behind original:master.
...

到控制台。

编辑：用块注释替换注释以提高粘贴能力

Answer 2

以下书签将信息直接打印到网页上，如下所示：

作为书签添加（或粘贴到控制台）的代码：

javascript:(async () => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const aTags = [...document.querySelectorAll('div.repo a:last-of-type')].slice(1);

  for (const aTag of aTags) {
    /* fetch the forked repo as html, search for the "This branch is [n commits ahead,] [m commits behind]", print it directly onto the web page */
    await fetch(aTag.href)
      .then(x => x.text())
      .then(html => aTag.outerHTML += `${html.match(/This branch is.*/).pop().replace('This branch is', '').replace(/([0-9]+ commits? ahead)/, '<font color="#0c0">$1</font>').replace(/([0-9]+ commits? behind)/, '<font color="red">$1</font>')}`)
      .catch(console.error);
  }
})();

已从 this answer 修改。

奖金

以下书签也打印 ZIP 文件的链接：

作为书签添加（或粘贴到控制台）的代码：

javascript:(async () => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const aTags = [...document.querySelectorAll('div.repo a:last-of-type')].slice(1);

  for (const aTag of aTags) {
    /* fetch the forked repo as html, search for the "This branch is [n commits ahead,] [m commits behind]", print it directly onto the web page */
    await fetch(aTag.href)
      .then(x => x.text())
      .then(html => aTag.outerHTML += `${html.match(/This branch is.*/).pop().replace('This branch is', '').replace(/([0-9]+ commits? ahead)/, '<font color="#0c0">$1</font>').replace(/([0-9]+ commits? behind)/, '<font color="red">$1</font>')}` + " <a " + `${html.match(/href="[^"]*\.zip">/).pop() + "Download ZIP</a>"}`)
      .catch(console.error);
  }
})();

Answer 3

痒得完全一样，并编写了一个抓取器，将在渲染的HTML中打印的信息用于分叉：https://github.com/hbbio/forkizard

绝对不是完美的，而是一个临时解决方案。

Answer 4

active-forks并没有完全满足我的要求，但是它非常接近并且易于使用。

Answer 5

这是一个 Python 脚本，用于列出和克隆前面的所有分支。

它不使用 API。所以它不受速率限制的影响，也不需要身份验证。但如果 GitHub 网站设计发生变化，可能需要进行调整。

与其他答案中显示 ZIP 文件链接的书签不同，此脚本还保存有关提交的信息，因为它使用 git clone 并且还创建了一个带有概述的 commits.htm 文件。

import requests, re, os, sys

def content_from_url(url):
    # TODO handle internet being off and stuff
    text = requests.get(url).content
    return text

def clone_ahead_forks(forklist_url):
    forklist_htm = content_from_url(forklist_url)
    with open("forklist.htm", "w") as text_file:
        text_file.write(forklist_htm)
        
    is_root = True
    for match in re.finditer('<a class="(Link--secondary)?" href="(/([^/"]*)/[^/"]*)">', forklist_htm):
        fork_url = 'https://github.com'+match.group(2)
        fork_owner_login = match.group(3)
        fork_htm = content_from_url(fork_url)
        
        match2 = re.search('<div class="d-flex flex-auto">[^<]*?([0-9]+ commits? ahead(, [0-9]+ commits? behind)?)', fork_htm)
        # TODO if website design changes, fallback onto checking whether 'ahead'/'behind'/'even with' appear only once on the entire page - in that case they are not part of the username etc.
        
        sys.stdout.write('.')
        if match2 or is_root:
            if match2:
                aheadness = match2.group(1) # for example '1 commit ahead, 2 commits behind'
            else:
                aheadness = 'root repo'
                is_root = False # for subsequent iterations
            
            dir = fork_owner_login+' ('+aheadness+')'
            print dir
            
            os.mkdir(dir)
            os.chdir(dir)
            
            # save commits.htm
            commits_htm = content_from_url(fork_url+'/commits')            
            with open("commits.htm", "w") as text_file:
                text_file.write(commits_htm)
            
            # git clone
            os.system('git clone '+fork_url+'.git')
            print
            
            # no need to recurse into forks of forks because they are all listed on the initial page and being traversed already
                
            os.chdir('..')

forklist_url = 'https://github.com/cifkao/tonnetz-viz/network/members'

clone_ahead_forks(forklist_url)

这会创建例如以下目录，每个目录都包含各自的克隆存储库：

bakaiadam (2 commits ahead)
chumo (2 commits ahead, 4 commits behind)
cifkao (root repo)
codedot (76 commits ahead, 27 commits behind)
k-hatano (41 commits ahead)
shimafuri (11 commits ahead, 8 commits behind)

Answer 6

这是一个 Python 脚本，用于列出和克隆前面的分叉。此脚本部分使用了 API，因此它会触发速率限制（您可以通过在脚本中添加 GitHub API authentication 来扩展速率限制（不是无限），请编辑或发布）。

一开始我尝试完全使用 API，但触发速率限制太快，所以现在我使用 is_fork_ahead_HTML 而不是 is_fork_ahead_API。如果 GitHub 网站设计发生变化，这可能需要进行调整。

由于速率限制，我更喜欢我在此处发布的其他答案。

import requests, json, os, re

def obj_from_json_from_url(url):
    # TODO handle internet being off and stuff
    text = requests.get(url).content
    obj = json.loads(text)
    return obj, text

def is_fork_ahead_API(fork, default_branch_of_parent):
    """ Use the GitHub API to check whether `fork` is ahead.
     This triggers the rate limit, so prefer the non-API version below instead.
    """
    # Compare default branch of original repo with default branch of fork.
    comparison, comparison_json = obj_from_json_from_url('https://api.github.com/repos/'+user+'/'+repo+'/compare/'+default_branch_of_parent+'...'+fork['owner']['login']+':'+fork['default_branch'])
    if comparison['ahead_by']>0:
        return comparison_json
    else:
        return False

def is_fork_ahead_HTML(fork):
    """ Use the GitHub website to check whether `fork` is ahead.
    """
    htm = requests.get(fork['html_url']).content
    match = re.search('<div class="d-flex flex-auto">[^<]*?([0-9]+ commits? ahead(, [0-9]+ commits? behind)?)', htm)
    # TODO if website design changes, fallback onto checking whether 'ahead'/'behind'/'even with' appear only once on the entire page - in that case they are not part of the username etc.
    if match:
        return match.group(1) # for example '1 commit ahead, 114 commits behind'
    else:
        return False

def clone_ahead_forks(user,repo):
    obj, _ = obj_from_json_from_url('https://api.github.com/repos/'+user+'/'+repo)
    default_branch_of_parent = obj["default_branch"]
    
    page = 0
    forks = None
    while forks != [{}]:
        page += 1
        forks, _ = obj_from_json_from_url('https://api.github.com/repos/'+user+'/'+repo+'/forks?per_page=100&page='+str(page))

        for fork in forks:
            aheadness = is_fork_ahead_HTML(fork)
            if aheadness:
                #dir = fork['owner']['login']+' ('+str(comparison['ahead_by'])+' commits ahead, '+str(comparison['behind_by'])+'commits behind)'
                dir = fork['owner']['login']+' ('+aheadness+')'
                print dir
                os.mkdir(dir)
                os.chdir(dir)
                os.system('git clone '+fork['clone_url'])
                print
                
                # recurse into forks of forks
                if fork['forks_count']>0:
                    clone_ahead_forks(fork['owner']['login'], fork['name'])
                    
                os.chdir('..')

user = 'cifkao'
repo = 'tonnetz-viz'

clone_ahead_forks(user,repo)

如何确定GitHub上的哪些分支在前面？

6 个答案:

奖金