当前位置:  开发笔记 > 编程语言 > 正文

在git repo中查找超过x兆字节的文件,这些文件在HEAD中不存在

如何解决《在gitrepo中查找超过x兆字节的文件,这些文件在HEAD中不存在》经验,为你挑选了5个好方法。

我有一个Git存储库我存储随机的东西.大多是随机脚本,文本文件,我设计的网站等等.

我随着时间的推移删除了一些大型二进制文件(通常为1-5MB),它们会增加存储库的大小,这在修订历史中是不需要的.

基本上我希望能够做到..

me@host:~$ [magic command or script]
aad29819a908cc1c05c3b1102862746ba29bafc0 : example/blah.psd : 3.8MB : 130 days old
6e73ca29c379b71b4ff8c6b6a5df9c7f0f1f5627 : another/big.file : 1.12MB : 214 days old

..然后能够查看每个结果,检查是否不再需要然后删除它(可能使用filter-branch)



1> Aristotle Pa..:

这是我之前发布的git-find-blob脚本的改编:

#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;

sub usage { die "usage: git-large-blob  []\n" }

@ARGV or usage();
my ( $max_size, $unit ) = ( shift =~ /^(\d+)([bkm]?)\z/ ) ? ( $1, $2 ) : usage();

my $exp = 10 * ( $unit eq 'b' ? 0 : $unit eq 'k' ? 1 : 2 );
my $cutoff = $max_size * 2**$exp; 

sub walk_tree {
    my ( $tree, @path ) = @_;
    my @subtree;
    my @r;

    {
        open my $ls_tree, '-|', git => 'ls-tree' => -l => $tree
            or die "Couldn't open pipe to git-ls-tree: $!\n";

        while ( <$ls_tree> ) {
            my ( $type, $sha1, $size, $name ) = /\A[0-7]{6} (\S+) (\S+) +(\S+)\t(.*)/;
            if ( $type eq 'tree' ) {
                push @subtree, [ $sha1, $name ];
            }
            elsif ( $type eq 'blob' and $size >= $cutoff ) {
                push @r, [ $size, @path, $name ];
            }
        }
    }

    push @r, walk_tree( $_->[0], @path, $_->[1] )
        for @subtree;

    return @r;
}

memoize 'walk_tree';

open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %cr'
    or die "Couldn't open pipe to git-log: $!\n";

my %seen;
while ( <$log> ) {
    chomp;
    my ( $tree, $commit, $age ) = split " ", $_, 3;
    my $is_header_printed;
    for ( walk_tree( $tree ) ) {
        my ( $size, @path ) = @$_;
        my $path = join '/', @path;
        next if $seen{ $path }++;
        print "$commit $age\n" if not $is_header_printed++;
        print "\t$size\t$path\n";
    }
}


我很难理解这段代码.有关如何使用nice命令的任何示例?
啊哈.没有争论.它只是花了一些时间才能输出任何东西到屏幕上.git-large-blob 500k

2> mislav..:

更紧凑的红宝石脚本:

#!/usr/bin/env ruby -w
head, treshold = ARGV
head ||= 'HEAD'
Megabyte = 1000 ** 2
treshold = (treshold || 0.1).to_f * Megabyte

big_files = {}

IO.popen("git rev-list #{head}", 'r') do |rev_list|
  rev_list.each_line do |commit|
    commit.chomp!
    for object in `git ls-tree -zrl #{commit}`.split("\0")
      bits, type, sha, size, path = object.split(/\s+/, 5)
      size = size.to_i
      big_files[sha] = [path, size, commit] if size >= treshold
    end
  end
end

big_files.each do |sha, (path, size, commit)|
  where = `git show -s #{commit} --format='%h: %cr'`.chomp
  puts "%4.1fM\t%s\t(%s)" % [size.to_f / Megabyte, path, where]
end

用法:

ruby big_file.rb [rev] [size in MB]
$ ruby big_file.rb master 0.3
3.8M  example/blah.psd  (aad2981: 4 months ago)
1.1M  another/big.file  (6e73ca2: 2 weeks ago)


这是一个很好的答案,但确实有一个缺陷.大对象存储在哈希`big_files`中,它使用`sha`作为唯一键.理论上这很好 - 毕竟每个对象blob都是独一无二的.但是,在实践中,可以想象您在存储库中的多个位置具有_exactly_相同的文件.例如,这可能是一个测试文件,它需要不同的文件名但不是不同的_physical content_.**如果您看到一个大型对象带有您不需要但您不知道的路径,则会出现问题,这个文件存在于其他需要的地方.**

3> SigTerm..:

Python脚本做同样的事情(基于这篇文章):

#!/usr/bin/env python

import os, sys

def getOutput(cmd):
    return os.popen(cmd).read()

if (len(sys.argv) <> 2):
    print "usage: %s size_in_bytes" % sys.argv[0]
else:
    maxSize = int(sys.argv[1])

    revisions = getOutput("git rev-list HEAD").split()

    bigfiles = set()
    for revision in revisions:
        files = getOutput("git ls-tree -zrl %s" % revision).split('\0')
        for file in files:
            if file == "":
                continue
            splitdata = file.split()
            commit = splitdata[2]
            if splitdata[3] == "-":
                continue
            size = int(splitdata[3])
            path = splitdata[4]
            if (size > maxSize):
                bigfiles.add("%10d %s %s" % (size, commit, path))

    bigfiles = sorted(bigfiles, reverse=True)

    for f in bigfiles:
        print f



4> 小智..:

哎哟...第一个剧本(亚里士多德),很慢.在git.git repo上,查找> 100k的文件,它会占用CPU大约6分钟.

它似乎也打印了几个错误的SHA - 通常打印的SHA与下一行中提到的文件名无关.

这是一个更快的版本.输出格式不同,但速度非常快,而且 - 据我所知 - 正确.

该方案多一点的时间,但很多是空话.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

use File::Temp qw(tempdir);
END { chdir( $ENV{HOME} ); }
my $tempdir = tempdir( "git-files_tempdir.XXXXXXXXXX", TMPDIR => 1, CLEANUP => 1 );

my $min = shift;
$min =~ /^\d+$/ or die "need a number";

# ----------------------------------------------------------------------

my @refs =qw(HEAD);
@refs = @ARGV if @ARGV;

# first, find blob SHAs and names (no sizes here)
open( my $objects, "-|", "git", "rev-list", "--objects", @refs) or die "rev-list: $!";
open( my $blobfile, ">", "$tempdir/blobs" ) or die "blobs out: $!";

my ( $blob, $name );
my %name;
my %size;
while (<$objects>) {
    next unless / ./;    # no commits or top level trees
    ( $blob, $name ) = split;
    $name{$blob} = $name;
    say $blobfile $blob;
}
close($blobfile);

# next, use cat-file --batch-check on the blob SHAs to get sizes
open( my $sizes, "-|", "< $tempdir/blobs git cat-file --batch-check | grep blob" ) or die "cat-file: $!";

my ( $dummy, $size );
while (<$sizes>) {
    ( $blob, $dummy, $size ) = split;
    next if $size < $min;
    $size{ $name{$blob} } = $size if ( $size{ $name{$blob} } || 0 ) < $size;
}

my @names_by_size = sort { $size{$b} <=> $size{$a} } keys %size;

say "
The size shown is the largest that file has ever attained.  But note
that it may not be that big at the commit shown, which is merely the
most recent commit affecting that file.
";

# finally, for each name being printed, find when it was last updated on each
# branch that we're concerned about and print stuff out
for my $name (@names_by_size) {
    say "$size{$name}\t$name";

    for my $r (@refs) {
        system("git --no-pager log -1 --format='%x09%h%x09%x09%ar%x09$r' $r -- $name");
    }
    print "\n";
}
print "\n";



5> Roberto Tyle..:

您想使用BFG Repo-Cleaner,这是一种更快,更简单的替代品,git-filter-branch专门用于从Git repos中删除大文件.

下载BFG jar(需要Java 6或更高版本)并运行以下命令:

$ java -jar bfg.jar  --strip-blobs-bigger-than 1M  my-repo.git

任何超过1M的文件(不在最近的提交中)都将从Git存储库的历史记录中删除.然后,您可以使用git gc清除死数据:

$ git gc --prune=now --aggressive

BFG通常比运行速度快10-50倍,git-filter-branch并且这些选项围绕这两种常见用例进行了定制:

删除疯狂的大文件

删除密码,凭据和其他私人数据

完全披露:我是BFG Repo-Cleaner的作者.


BFG非常适合删除文件,但如何查找_which_文件将被删除?
推荐阅读
无名有名我无名_593
这个屌丝很懒,什么也没留下!
DevBox开发工具箱 | 专业的在线开发工具网站    京公网安备 11010802040832号  |  京ICP备19059560号-6
Copyright © 1998 - 2020 DevBox.CN. All Rights Reserved devBox.cn 开发工具箱 版权所有