I'm trying to convert a large history from Perforce to Git, and one folder (now git branch) contains a significant number of large binary files. My problem is that I'm running out of memory while running
git gc --aggressive.
My primary question here is whether repacking the repository is likely to have any meaningful effect on large binaries. Compressing them another 20% would be great. 0.2% isn't worth my effort. If not, I'll have them skipped over as suggested <a href="https://stackoverflow.com/a/8686576" rel="nofollow">here</a>.
For background, I successfully used
git p4 to create the repository in a state I'm happy with, but this uses
git fast-import behind the scenes so I want to optimize the repository before making it official, and indeed making any commits automatically triggered a slow
gc --auto. It's currently ~35GB in a bare state.
The binaries in question seem to be, conceptually, the vendor firmware used in embedded devices. I think there are approximately 25 in the 400-700MB range and maybe a couple hundred more in the 20-50MB range. They might be disk images, but I'm unsure of that. There's a variety of versions and file types over time, and I see
.simg files frequently. As such, I'd expect the raw code to have significant overlap, but I'm not sure how similar the actual files appear at this point, as I believe these formats have already been compressed, right?
These binaries are contained in one (old) branch that will be used excessively rarely (to the point questioning version control at all is valid, but out of scope). Certainly the performance of that branch does not need to be great. But I'd like the rest of the repository to be reasonable.
Other suggestions for optimal packing or memory management are welcome. I admit I don't really understand the various git options being discussed on the linked question. Nor do I really understand what the
--depth flags are doing in
git repack. But the primary question is whether the repacking of the binaries themselves is doing anything meaningful.
My primary question here is whether repacking the repository is likely to have any meaningful effect on large binaries.</blockquote>
That depends on their contents. For the files you've outlined specifically:<blockquote>
I see .zip, tgz, and .simg files frequently.</blockquote>
Zipfiles and tgz (gzipped tar archive) files are already compressed and have terrible (i.e., high) <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)" rel="nofollow">Shannon entropy</a> values—terrible for Git that is—and will not compress against each other. The
.simg files are probably (I have to guess here) <a href="http://singularity.lbl.gov/docs-recipes" rel="nofollow">Singularity disk image files</a>; whether and how they are compressed, I don't know, but I would assume they are. (An easy test is to feed one to a compressor, e.g., gzip, and see if it shrinks.)
As such, I'd expect the raw code to have significant overlap, but I'm not sure how similar the actual files appear at this point, as I believe these formats have already been compressed, right?</blockquote>
Precisely. Storing them <em>uncompressed</em> in Git would thus, paradoxically, result in far greater compression in the end. (But the packing could require significant amounts of memory.)<blockquote>
If [this is probably futile], I'll have them skipped over as suggested <a href="https://stackoverflow.com/a/8686576" rel="nofollow">here</a>.</blockquote>
That would be my first impulse here. :-)<blockquote>
I admit I don't really understand the various git options being discussed on the linked question. Nor do I really understand what the
--depth flags are doing in
The various limits are confusing (and profuse). It's also important to realize that they don't get copied on clone, since they are in
.git/config which is not a committed file, so new clones won't pick them up. The
.gitattributes file <em>is</em> copied on clone and new clones will continue to avoid packing unpackable files, so it's the better approach here.
(If you care to dive into the details, you will find some in <a href="https://github.com/git/git/blob/master/Documentation/technical/pack-heuristics.txt" rel="nofollow">the Git technical documentation</a>. This does not discuss precisely what the window sizes are about, but it has to do with how much memory Git uses to memory-map object data when selecting objects that might compress well against each other. There are two: one for each individual mmap on one pack file, and one for the total aggregate mmap on all pack files. Not mentioned on your link:
core.deltaBaseCacheLimit, which is how much memory will be used to hold delta bases—but to understand this you need to grok delta compression and delta chains,<sup>1</sup> and read that same technical documentation. Note that Git will default to not attempting to pack any file object whose size exceeds
core.bigFileThreshold. The various
pack.* controls are a bit more complex: the packing is done multi-threaded to take advantage of all your CPUs if possible, and each thread can use a lot of memory. Limiting the number of threads limits total memory use: if one thread is going to use 256 MB, 8 threads are likely to use 8*256 = 2048 MB or 2 GB. The bitmaps mainly speed up fetching from busy servers.)
<sup>1</sup>They're not that complicated: a delta chain occurs when one object says "take object XYZ and apply these changes", but object XYZ itself says "take object PreXYZ and apply these changes". Object PreXYZ can also take another object, and so on. The <em>delta base</em> is the object at the bottom of this list.Answer2:
Other suggestions for optimal packing or memory management are welcome.</blockquote>
Git 2.20 (Q4 2018) will have one: When there are too many packfiles in a repository (which is not recommended), looking up an object in these would require consulting many pack
.idx files; <strong>a new mechanism to have a single file that consolidates all of these
.idx files is introduced</strong>.
See <a href="https://github.com/git/git/commit/6a22d521260f86dff8fe6f23ab329cebb62ba4f0" rel="nofollow">commit 6a22d52</a>, <a href="https://github.com/git/git/commit/e9ab2ed7de33a399b44295628e587db6a57bf897" rel="nofollow">commit e9ab2ed</a>, <a href="https://github.com/git/git/commit/454ea2e4d7036862e8b2f69ef2dea640f8787510" rel="nofollow">commit 454ea2e</a>, <a href="https://github.com/git/git/commit/0bff5269d3ed7124259bb3a5b33ddf2c4080b7e7" rel="nofollow">commit 0bff526</a>, <a href="https://github.com/git/git/commit/29e2016b8f952c900b2f4ce148be5279c53fd9e3" rel="nofollow">commit 29e2016</a>, <a href="https://github.com/git/git/commit/fe86c3beb5893edd4e5648dab8cca66d6cc2e77d" rel="nofollow">commit fe86c3b</a>, <a href="https://github.com/git/git/commit/c39b02ae0ae90b9fda353f87502ace9ba36db839" rel="nofollow">commit c39b02a</a>, <a href="https://github.com/git/git/commit/2cf489a3bf75d7569c228147c3d9c559f02fd62c" rel="nofollow">commit 2cf489a</a>, <a href="https://github.com/git/git/commit/6d68e6a46174746d95373a47ab4ef4f57aa56d22" rel="nofollow">commit 6d68e6a</a> (20 Aug 2018) by <a href="https://github.com/derrickstolee" rel="nofollow">Derrick Stolee (
derrickstolee)</a>.<br /><sup>(Merged by <a href="https://github.com/gitster" rel="nofollow">Junio C Hamano --
gitster --</a> in <a href="https://github.com/git/git/commit/49f210fd5279eeb0106cd7e4383a1c4454d30428" rel="nofollow">commit 49f210f</a>, 17 Sep 2018)</sup>
pack-objects: consider packs in multi-pack-index</h2>
When running 'git pack-objects --local', we want to avoid packing objects that are in an alternate.<br /> Currently, we check for these objects using the packed_git_mru list, which excludes the pack-files covered by a multi-pack-index.</blockquote>
There is a new setting:<blockquote>
Use the multi-pack-index file to track multiple packfiles using a single index.</blockquote>
And that <a href="https://github.com/git/git/blob/49f210fd5279eeb0106cd7e4383a1c4454d30428/Documentation/git-multi-pack-index.txt" rel="nofollow">multi-pack index is explained here</a> and in <a href="https://github.com/git/git/blob/49f210fd5279eeb0106cd7e4383a1c4454d30428/Documentation/technical/multi-pack-index.txt" rel="nofollow">
The Git object directory contains a '
pack' directory containing:
.pack") and </li> <li><strong>pack-indexes</strong> (with suffix "
.idx"). </li> </ul><blockquote>
The pack-indexes provide a way to lookup objects and navigate to their offset within the pack, but these must come in pairs with the packfiles.<br /> This pairing depends on the file names, as the pack-index differs only in suffix with its pack-file.
While the pack-indexes provide fast lookup per packfile, this performance degrades as the number of packfiles increases, because abbreviations need to inspect every packfile and we are more likely to have a miss on our most-recently-used packfile.
<strong>For some large repositories, repacking into a single packfile is not feasible due to storage space or excessive repack times.</strong>
The <strong>multi-pack-index</strong> (<strong>MIDX</strong> for short) stores a list of objects and their offsets into multiple packfiles.<br /> It contains:<ul><li>A list of packfile names.</li> <li>A sorted list of object IDs.</li> <li>A list of metadata for the ith object ID including: <ul><li>A value j referring to the jth packfile.</li> <li>An offset within the jth packfile for the object.</li> </ul></li> <li>If large offsets are required, we use another list of large offsets similar to version 2 pack-indexes.</li> </ul>
<strong>Thus, we can provide
O(log N) lookup time for any number of packfiles.</strong>