我有一个Git存储库我存储随机的东西.大多是随机脚本,文本文件,我设计的网站等等.
我随着时间的推移删除了一些大型二进制文件(通常为1-5MB),它们会增加存储库的大小,这在修订历史中是不需要的.
基本上我希望能够做到..
me@host:~$ [magic command or script] aad29819a908cc1c05c3b1102862746ba29bafc0 : example/blah.psd : 3.8MB : 130 days old 6e73ca29c379b71b4ff8c6b6a5df9c7f0f1f5627 : another/big.file : 1.12MB : 214 days old
..然后能够查看每个结果,检查是否不再需要然后删除它(可能使用filter-branch
)
这是我之前发布的git-find-blob
脚本的改编:
#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;
sub usage { die "usage: git-large-blob []\n" }
@ARGV or usage();
my ( $max_size, $unit ) = ( shift =~ /^(\d+)([bkm]?)\z/ ) ? ( $1, $2 ) : usage();
my $exp = 10 * ( $unit eq 'b' ? 0 : $unit eq 'k' ? 1 : 2 );
my $cutoff = $max_size * 2**$exp;
sub walk_tree {
my ( $tree, @path ) = @_;
my @subtree;
my @r;
{
open my $ls_tree, '-|', git => 'ls-tree' => -l => $tree
or die "Couldn't open pipe to git-ls-tree: $!\n";
while ( <$ls_tree> ) {
my ( $type, $sha1, $size, $name ) = /\A[0-7]{6} (\S+) (\S+) +(\S+)\t(.*)/;
if ( $type eq 'tree' ) {
push @subtree, [ $sha1, $name ];
}
elsif ( $type eq 'blob' and $size >= $cutoff ) {
push @r, [ $size, @path, $name ];
}
}
}
push @r, walk_tree( $_->[0], @path, $_->[1] )
for @subtree;
return @r;
}
memoize 'walk_tree';
open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %cr'
or die "Couldn't open pipe to git-log: $!\n";
my %seen;
while ( <$log> ) {
chomp;
my ( $tree, $commit, $age ) = split " ", $_, 3;
my $is_header_printed;
for ( walk_tree( $tree ) ) {
my ( $size, @path ) = @$_;
my $path = join '/', @path;
next if $seen{ $path }++;
print "$commit $age\n" if not $is_header_printed++;
print "\t$size\t$path\n";
}
}
更紧凑的红宝石脚本:
#!/usr/bin/env ruby -w
head, treshold = ARGV
head ||= 'HEAD'
Megabyte = 1000 ** 2
treshold = (treshold || 0.1).to_f * Megabyte
big_files = {}
IO.popen("git rev-list #{head}", 'r') do |rev_list|
rev_list.each_line do |commit|
commit.chomp!
for object in `git ls-tree -zrl #{commit}`.split("\0")
bits, type, sha, size, path = object.split(/\s+/, 5)
size = size.to_i
big_files[sha] = [path, size, commit] if size >= treshold
end
end
end
big_files.each do |sha, (path, size, commit)|
where = `git show -s #{commit} --format='%h: %cr'`.chomp
puts "%4.1fM\t%s\t(%s)" % [size.to_f / Megabyte, path, where]
end
用法:
ruby big_file.rb [rev] [size in MB] $ ruby big_file.rb master 0.3 3.8M example/blah.psd (aad2981: 4 months ago) 1.1M another/big.file (6e73ca2: 2 weeks ago)
Python脚本做同样的事情(基于这篇文章):
#!/usr/bin/env python
import os, sys
def getOutput(cmd):
return os.popen(cmd).read()
if (len(sys.argv) <> 2):
print "usage: %s size_in_bytes" % sys.argv[0]
else:
maxSize = int(sys.argv[1])
revisions = getOutput("git rev-list HEAD").split()
bigfiles = set()
for revision in revisions:
files = getOutput("git ls-tree -zrl %s" % revision).split('\0')
for file in files:
if file == "":
continue
splitdata = file.split()
commit = splitdata[2]
if splitdata[3] == "-":
continue
size = int(splitdata[3])
path = splitdata[4]
if (size > maxSize):
bigfiles.add("%10d %s %s" % (size, commit, path))
bigfiles = sorted(bigfiles, reverse=True)
for f in bigfiles:
print f
哎哟...第一个剧本(亚里士多德),很慢.在git.git repo上,查找> 100k的文件,它会占用CPU大约6分钟.
它似乎也打印了几个错误的SHA - 通常打印的SHA与下一行中提到的文件名无关.
这是一个更快的版本.输出格式不同,但速度非常快,而且 - 据我所知 - 正确.
该方案是多一点的时间,但很多是空话.
#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;
use File::Temp qw(tempdir);
END { chdir( $ENV{HOME} ); }
my $tempdir = tempdir( "git-files_tempdir.XXXXXXXXXX", TMPDIR => 1, CLEANUP => 1 );
my $min = shift;
$min =~ /^\d+$/ or die "need a number";
# ----------------------------------------------------------------------
my @refs =qw(HEAD);
@refs = @ARGV if @ARGV;
# first, find blob SHAs and names (no sizes here)
open( my $objects, "-|", "git", "rev-list", "--objects", @refs) or die "rev-list: $!";
open( my $blobfile, ">", "$tempdir/blobs" ) or die "blobs out: $!";
my ( $blob, $name );
my %name;
my %size;
while (<$objects>) {
next unless / ./; # no commits or top level trees
( $blob, $name ) = split;
$name{$blob} = $name;
say $blobfile $blob;
}
close($blobfile);
# next, use cat-file --batch-check on the blob SHAs to get sizes
open( my $sizes, "-|", "< $tempdir/blobs git cat-file --batch-check | grep blob" ) or die "cat-file: $!";
my ( $dummy, $size );
while (<$sizes>) {
( $blob, $dummy, $size ) = split;
next if $size < $min;
$size{ $name{$blob} } = $size if ( $size{ $name{$blob} } || 0 ) < $size;
}
my @names_by_size = sort { $size{$b} <=> $size{$a} } keys %size;
say "
The size shown is the largest that file has ever attained. But note
that it may not be that big at the commit shown, which is merely the
most recent commit affecting that file.
";
# finally, for each name being printed, find when it was last updated on each
# branch that we're concerned about and print stuff out
for my $name (@names_by_size) {
say "$size{$name}\t$name";
for my $r (@refs) {
system("git --no-pager log -1 --format='%x09%h%x09%x09%ar%x09$r' $r -- $name");
}
print "\n";
}
print "\n";
您想使用BFG Repo-Cleaner,这是一种更快,更简单的替代品,git-filter-branch
专门用于从Git repos中删除大文件.
下载BFG jar(需要Java 6或更高版本)并运行以下命令:
$ java -jar bfg.jar --strip-blobs-bigger-than 1M my-repo.git
任何超过1M的文件(不在最近的提交中)都将从Git存储库的历史记录中删除.然后,您可以使用git gc
清除死数据:
$ git gc --prune=now --aggressive
BFG通常比运行速度快10-50倍,git-filter-branch
并且这些选项围绕这两种常见用例进行了定制:
删除疯狂的大文件
删除密码,凭据和其他私人数据
完全披露:我是BFG Repo-Cleaner的作者.