上下文
我们正忙于将subversion存储库迁移到多个git存储库。其中一个存储库小于100MB,但.git存储库超过5GB。目的是保留git历史记录,但应该删除大文件。我们不希望.git存储库大于300MB,否则克隆git存储库需要很长时间。
电流的
[user@localhost testGitMigration]$ du -h .git/
4.0K .git/refs/heads
0 .git/refs/tags
4.0K .git/refs/original/refs/heads
4.0K .git/refs/original/refs
4.0K .git/refs/original
8.0K .git/refs
0 .git/branches
40K .git/hooks
8.0K .git/info
808M .git/objects/pack
4.0K .git/objects/info
4.0K .git/objects/2d
4.0K .git/objects/14
4.0K .git/objects/a8
808M .git/objects
4.0K .git/logs/refs/heads
4.0K .git/logs/refs
8.0K .git/logs
808M .git/
预期
[user@localhost testGitMigration]$ du -h .git/
4.0K .git/refs/heads
0 .git/refs/tags
4.0K .git/refs/original/refs/heads
4.0K .git/refs/original/refs
4.0K .git/refs/original
8.0K .git/refs
0 .git/branches
40K .git/hooks
8.0K .git/info
1M .git/objects/pack
4.0K .git/objects/info
4.0K .git/objects/2d
4.0K .git/objects/14
4.0K .git/objects/a8
1M .git/objects
4.0K .git/logs/refs/heads
4.0K .git/logs/refs
8.0K .git/logs
1M .git/
问题陈述
如上下文段落中所定义,目标是摆脱.git存储库中过大的文件。事实证明,当天早些时候,一些人都承诺了。虽然可以将svn文件夹迁移到git存储库,但历史记录看起来相同。 .git超过5GB,而repo的内容小于100MB。如果大文件将从git repo中删除,历史记录是否仍然正确或已损坏?总之,存储库不应大于5GB,而内容小于100MB。
有没有人有这种迁移的经验?我能想到的另一个解决方案是忽略.git历史记录并将文件提交到新的存储库,但随后所有历史记录都将消失。首选是保留历史记录,但删除过大的文件。如何找到这些太大的文件?存储库是在2013年创建的,不清楚应该从日志中删除哪些文件以及如何在不破坏日志的情况下执行此操作。
示例代码和数据
为了重现这一点,我们通过运行mkdir testGitMigration
,cd testGitMigration
和git init
在本地创建了一个新的git存储库。
通过现有git存储库,创建了两个文本文件并下载了一个iso:
[user@localhost testGitMigration]$ du -h *
4.0K hello
825M ubuntu-16.04.3-server-amd64.iso
4.0K world
正如您所看到的,有一个大文件,ubuntu-16.04.3-server-amd64.iso,大于800MB。在我们正在经历的当前形势的日子里,可能会添加多个这样的大型文件。由于.git存储库包含所有历史记录,因此该目录的大小可能大于808MB:
[user@localhost testGitMigration]$ du -h .git/
4.0K .git/refs/heads
0 .git/refs/tags
4.0K .git/refs
0 .git/branches
40K .git/hooks
4.0K .git/info
808M .git/objects/pack
0 .git/objects/info
4.0K .git/objects/ce
4.0K .git/objects/b4
4.0K .git/objects/3b
4.0K .git/objects/55
4.0K .git/objects/53
4.0K .git/objects/cc
4.0K .git/objects/a1
4.0K .git/objects/5c
4.0K .git/objects/c7
4.0K .git/objects/fe
808M .git/objects
4.0K .git/logs/refs/heads
4.0K .git/logs/refs
8.0K .git/logs
808M .git/
让我们看看如果将删除iso将会发生什么:
[user@localhost testGitMigration]$ git log
commit fe7455c0eb6964772526eb848255a6eb11f2283a
Author: user <user@user.user>
Date: Wed Dec 20 21:28:00 2017 +0100
removed iso
commit 5ce0fed4ebe891accd9a1fc3f0ee8ebd3af8d7f0
Author: user <user@user.user>
Date: Wed Dec 20 21:22:54 2017 +0100
third file
commit 53dd97210f2b7b8270d66698bb0438d5071b0038
Author: user <user@user.user>
Date: Wed Dec 20 21:22:40 2017 +0100
second file
commit 3b1baf9d65f051b4fc402d7375f3ff199ddd2dab
Author: user <user@user.user>
Date: Wed Dec 20 21:19:24 2017 +0100
first file
虽然已删除iso,但存储库大小仍大于800MB。因此,这表明有可能在当天添加了多个大文件,并且这些文件被删除,因为repo本身小于300MB且git repo大于5GB。
那么如何摆脱这些大文件呢?如果我们试图在这个测试场景中实现这一点,那么期望的是repo将小于1MB,因为这个repo中的文件的磁盘使用率没有iso,如下所示:
[user@localhost testGitMigration]$ du -h *
4.0K hello
4.0K world
根据git log的输出,无法看到每个提交的大小。那么如何获得每次提交的大小呢?
找到了此代码。
https://gist.github.com/magnetikonline/dd5837d597722c9c2d5dfa16d8efe5b9
#!/bin/bash -e
# work over each commit and append all files in tree to $tempFile
tempFile=$(mktemp)
for commitSHA1 in $(git rev-list --all); do
git ls-tree -r --long "$commitSHA1" >>"$tempFile"
done
# sort files by SHA1, de-dupe list and finally re-sort by filesize
sort --key 3 "$tempFile" | \
uniq | \
sort --key 4 --numeric-sort --reverse
# remove temp file
rm "$tempFile"
运行时,它显示以下输出:
[user@localhost testGitMigration]$ ./gitlistobjectbysize.sh
100644 blob 02b6feb032c58dc07eb18af81a4067fbf154cc30 865075200 ubuntu-16.04.3-server-amd64.iso
100644 blob ce013625030ba8dba906f756967f9e9ca394464a 6 hello
100644 blob cc628ccd10742baea8241c5924df992b5c019f71 6 world
所以找到了包含大文件的提交!我们将其删除!
[user@localhost testGitMigration]$ git filter-branch --tree-filter 'rm -f ubuntu-16.04.3-server-amd64.iso' -- --all
Rewrite fe7455c0eb6964772526eb848255a6eb11f2283a (4/4)
Ref 'refs/heads/master' was rewritten
现在.git的大小应该小于1MB吗?我们来看看:
[user@localhost testGitMigration]$ du -h .
4.0K ./.git/refs/heads
0 ./.git/refs/tags
4.0K ./.git/refs/original/refs/heads
4.0K ./.git/refs/original/refs
4.0K ./.git/refs/original
8.0K ./.git/refs
0 ./.git/branches
40K ./.git/hooks
8.0K ./.git/info
808M ./.git/objects/pack
4.0K ./.git/objects/info
4.0K ./.git/objects/2d
4.0K ./.git/objects/14
4.0K ./.git/objects/a8
808M ./.git/objects
4.0K ./.git/logs/refs/heads
4.0K ./.git/logs/refs
8.0K ./.git/logs
808M ./.git
808M .
还是一样吗?怎么可能?
那么删除整个提交呢?这不是一个选项,因为它是可能的,也可能是多个文件是某个提交的一部分。如果整个提交将被删除,那么太大的文件可能会消失,但是所有其他需要保存的文件也会消失。这不是一个选择。
一旦iso被移除,也许应该进行垃圾收集?
[user@localhost testGitMigration]$ git gc
Counting objects: 14, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (14/14), done.
Total 14 (delta 5), reused 9 (delta 1)
.git dir现在小于1MB?号:
[user@localhost testGitMigration]$ du -h .git/
0 .git/refs/heads
0 .git/refs/tags
0 .git/refs/original
0 .git/refs
0 .git/branches
40K .git/hooks
8.0K .git/info
808M .git/objects/pack
4.0K .git/objects/info
808M .git/objects
4.0K .git/logs/refs/heads
4.0K .git/logs/refs
8.0K .git/logs
808M .git/