Question

我想从我的git存储库中删除大文件。但是，我想具体一点，所以我想在存储库的所有历史记录中查看所有文件大小？

我已经创建了以下bash脚本，但它似乎非常缺乏，并且可能缺少已在历史记录中删除的文件：

git log --pretty=tformat:%H | while read hash; do
   git show --stat --name-only $hash | grep -P '^(?:(?!commit|Author:|Date:|Merge:|   ).)*$' | while read filename; do
      if [ ! -z "$filename" ]; then
          git show "$hash:$filename" | wc -c | while read filesize; do
             if [ $(echo "$filesize > 100000" | bc) -eq 1 ]; then
                printf "%-40s %11s %s\n" "$hash" "$filesize" "$filename"
             fi
          done
      fi
   done
done

有关更好的方法的任何建议吗？

Answer 1

你真的很喜欢那里。

git log --pretty=tformat:%H

这应该是git rev-list <start-points>，例如git rev-list HEAD或git rev-list --all。您可能希望添加--topo-order --reverse，我们会在稍后提供这些原因。

 | while read hash; do
   git show --stat --name-only $hash

而不是git show --stat，您可能只想在哈希上使用git ls-tree。使用递归git ls-tree，您将找到给定提交中的每个树和blob，以及相应的路径名。

树木可能不是很有趣，所以我们可能会下降到斑点。顺便提一下，注意git ls-tree将编码一些有问题的文件名，除非你使用-z（但是这会让你更难阅读这些项目; bash可以做到这一点，普通的sh不能）。< / p>

 | grep -P '^(?:(?!commit|Author:|Date:|Merge:|   ).)*$' | while read filename; do

使用git ls-tree我们可以将其替换为：

git ls-tree -r $hash | while read mode type objhash path; do

然后我们将跳过任何类型不是blob的东西：

[ $type == blob ] || continue

  if [ ! -z "$filename" ]; then

我们根本不需要这个。

      git show "$hash:$filename" | wc -c | while read filesize; do
         if [ $(echo "$filesize > 100000" | bc) -eq 1 ]; then
            printf "%-40s %11s %s\n" "$hash" "$filesize" "$filename"
         fi

我不清楚为什么你有while read filesize循环，也没有复杂的测试。在任何情况下，获取blob对象大小的简单方法是使用git cat-file -s $objhash，例如，可以很容易地测试[ $blobsize -gt 100000 ]：

    blobsize=$(git cat-file -s $objhash)
    if [ $blobsize -gt 100000 ]; then
       echo "$hash contains $filename size $blobsize"
    fi

但是，放弃git show支持git ls-tree -r，我们会在每次提交中看到每个的每个副本，而不仅仅是在它出现的第一次提交中看到它一次。例如，如果提交f00f1e添加大文件bigfile并且它在提交baafba6中保持不变，我们将同时看到它。使用git show --stat运行git diff的变体来比较每个提交与其父项的比较，以便我们在之前看到它时省略该文件。

轻微的缺陷（或者可能没有缺陷）是我们“重新看到”文件 back 。例如，如果在第三次提交中删除了该大文件并在第四次提交中恢复，我们将看到它两次。

这是可能想要--topo-order --reverse的地方。如果我们使用它，我们将在他们的孩子之前得到所有父提交。然后，我们可以保存每个诊断的对象哈希，并抑制重复诊断。这里有一个很好的编程语言，它有关联数组（哈希表）会很方便，但是我们可以在普通的bash中用一个包含以前显示的对象哈希值的文件或目录来做到这一点：

#! /bin/sh

# get temporary file to hold viewed object hashes
TF=$(mktemp)
trap "rm -f $TF" 0 1 2 3 15

BIG=100000  # files up to (and including?) this size are not-big

git rev-list --all --topo-order --reverse |
while read commithash; do
    git ls-tree -r $commithash |
    while read mode type objhash path; do
        [ $type == blob ] || continue      # only look at files
        blobsize=$(git cat-file -s $objhash)
        [ $blobsize -lt $BIG ] && continue # or -le
        # found a big file - have we seen it yet?
        grep $objhash $TF >/dev/null && continue
        echo "$blobsize byte file added at commit $commithash as $path"
        echo $objhash >> $TF # don't print again under any path name
    done
done

请注意，由于我们现在通过其哈希ID记住大文件，因此即使它们以其他名称重新显示（例如，获取git mv，或者被删除然后重新显示，我们也不会重新公布它们 - 以相同或其他名称出现。）

如果您更喜欢git show使用的diff-invoking方法，我们可以使用它来代替我们的哈希保存临时文件，但仍然可以通过使用适当的管道命令避免笨拙地删除提交消息，这是git diff-tree。使用--topo-order（一般规则）也可能是明智的，尽管不再需要它。所以这给了：

BIG=100000 # just as before

git rev-list --all --topo-order | while read commithash; do
    git diff-tree -r --name-only --diff-filter=AMT $commithash |
        tail -n +2 | while read path; do
            objsize=$(git cat-file -s "$commithash:$path")
            [ $objsize -lt $BIG ] && continue
            echo "$blobsize byte file added at commit $commithash as $path"
        done
done

git diff-tree需要-r递归工作（与git ls-tree相同），需要--name-only才能打印文件名，需要--diff-filter=AMT才能打印添加，修改或类型更改的文件的名称（从符号链接到文件，反之亦然）。令人讨厌的是，git diff-tree再次打印提交ID作为第一行。我们可以使用--no-commit-id来取消ID，但之后我们会得到一个空白行，因此我们也可以使用tail -n +2跳过第一行。

脚本的其余部分与您的相同，只是我们使用git cat-file -s轻松获取对象的大小，并使用[ / test程序直接测试它

请注意，对于合并提交，git diff-tree（如git show）使用组合差异，仅显示合并结果中与父级不匹配的文件。这应该没问题，因为如果合并结果中文件huge为4GB但与两个合并提交之一中的4GB文件huge相同，我们会看到huge何时合并添加到该提交，而不是在合并本身中看到它。

（如果这不合适，可以将-m添加到git diff-tree命令。但是，您需要删除tail -n +2并输入--no-commit-id在-m下表现不同.Git中的这种特殊行为有点令人讨厌，尽管它与默认输出格式有意义，类似于git log --raw。）

（注意：以上代码未经过测试 - 在上次重新阅读时发现并修复了$hash与$commithash。）

Answer 2

git ls-files command会为您提供所有文件的列表。如果您传递--debug选项，它将以以下格式输出其他数据：

path/filename.ext
  ctime: ${timestamp}:0
  mtime: ${timestamp}:0
  dev: 16777220 ino: 62244153
  uid: 1912685926   gid: 80
  size: ${bytes}    flags: 0

然后，您可以解析size值的结果，并将其与您设置的最大值进行比较。

Answer 3

git log --name-only --diff-filter=d --all --pretty=format:%H \
| awk '/^$/{c=""}!c{c=$1;next}{print c":"$0,c,$0}' \
| git cat-file --batch-check=$'%(rest)\t%(objectsize)'

在历史记录中的每个提交的提交ID之后，〜p显示所有未更改但未删除的文件，将列表重新格式化为

sha:path sha path

将其分别填充到--batch-check中以一次性提取尺寸大小。

如何获取整个git历史记录中每个文件的大小？

3 个答案: