353 lines
11 KiB
Groff
353 lines
11 KiB
Groff
.SH Introduction
|
|
.PP
|
|
Checking large binary files into a source repository (Git or otherwise)
|
|
is a bad idea because repository size quickly becomes unreasonable.
|
|
Even if the instantaneous working tree stays manageable, preserving
|
|
repository integrity requires all binary files in the entire project
|
|
history, which given the typically poor compression of binary diffs,
|
|
implies that the repository size will become impractically large.
|
|
Some people recommend checking binaries into different repositories or
|
|
even not versioning them at all, but these are not satisfying solutions
|
|
for most workflows.
|
|
.SS Features of \f[C]git\-fat\f[R]
|
|
.IP \[bu] 2
|
|
clones of the source repository are small and fast because no binaries
|
|
are transferred, yet fully functional with complete metadata and
|
|
incremental retrieval (\f[C]git clone \-\-depth\f[R] has limited
|
|
granularity and couples metadata to content)
|
|
.IP \[bu] 2
|
|
\f[C]git\-fat\f[R] supports the same workflow for large binaries and
|
|
traditionally versioned files, but internally manages the \[lq]fat\[rq]
|
|
files separately
|
|
.IP \[bu] 2
|
|
\f[C]git\-bisect\f[R] works properly even when versions of the binary
|
|
files change over time
|
|
.IP \[bu] 2
|
|
selective control of which large files to pull into the local store
|
|
.IP \[bu] 2
|
|
local fat object stores can be shared between multiple clones, even by
|
|
different users
|
|
.IP \[bu] 2
|
|
can easily support fat object stores distributed across multiple hosts
|
|
.IP \[bu] 2
|
|
depends only on stock Python and rsync
|
|
.SS Related projects
|
|
.IP \[bu] 2
|
|
git\-annex (http://git-annex.branchable.com) is a far more comprehensive
|
|
solution, but with less transparent workflow and with more dependencies.
|
|
.IP \[bu] 2
|
|
git\-media (https://github.com/schacon/git-media) adopts a similar
|
|
approach to \f[C]git\-fat\f[R], but with a different synchronization
|
|
philosophy and with many Ruby dependencies.
|
|
.SH Installation and configuration
|
|
.PP
|
|
Place \f[C]git\-fat\f[R] in your \f[C]PATH\f[R].
|
|
.PP
|
|
Edit (or create) \f[C].gitattributes\f[R] to regard any desired
|
|
extensions as fat files.
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
$ cd path\-to\-your\-repository
|
|
$ cat >> .gitattributes
|
|
*.png filter=fat \-crlf
|
|
*.jpg filter=fat \-crlf
|
|
*.gz filter=fat \-crlf
|
|
\[ha]D
|
|
\f[R]
|
|
.fi
|
|
.PP
|
|
Run \f[C]git fat init\f[R] to activate the extension.
|
|
Now add and commit as usual.
|
|
Matched files will be transparently stored externally, but will appear
|
|
complete in the working tree.
|
|
.PP
|
|
Set a remote store for the fat objects by editing \f[C].gitfat\f[R].
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
[rsync]
|
|
remote = your.remote\-host.org:/share/fat\-store
|
|
\f[R]
|
|
.fi
|
|
.PP
|
|
This file should typically be committed to the repository so that others
|
|
will automatically have their remote set.
|
|
This remote address can use any protocol supported by rsync.
|
|
.PP
|
|
Most users will configure it to use remote ssh in a directory with
|
|
shared access.
|
|
To do this, set the \f[C]sshuser\f[R] and \f[C]sshport\f[R] variables in
|
|
\f[C].gitfat\f[R] configuration file.
|
|
For example, to use rsync with ssh, with the default port (22) and
|
|
authenticate with the user \[lq]\f[I]fat\f[R]\[rq], your configuration
|
|
would look like this:
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
[rsync]
|
|
remote = your.remote\-host.org:/share/fat\-store
|
|
sshuser = fat
|
|
\f[R]
|
|
.fi
|
|
.SH A worked example
|
|
.PP
|
|
Before we start, let\[cq]s turn on verbose reporting so we can see
|
|
what\[cq]s happening.
|
|
Without this environment variable, all the output lines starting with
|
|
\f[C]git\-fat\f[R] will not be shown.
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
$ export GIT_FAT_VERBOSE=1
|
|
\f[R]
|
|
.fi
|
|
.PP
|
|
First, we create a repository and configure it for use with
|
|
\f[C]git\-fat\f[R].
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
$ git init repo
|
|
Initialized empty Git repository in /tmp/repo/.git/
|
|
$ cd repo
|
|
$ git fat init
|
|
$ cat > .gitfat
|
|
[rsync]
|
|
remote = localhost:/tmp/fat\-store
|
|
$ mkdir \-p /tmp/fat\-store # make sure the remote directory exists
|
|
$ echo \[aq]*.gz filter=fat \-crlf\[aq] > .gitattributes
|
|
$ git add .gitfat .gitattributes
|
|
$ git commit \-m\[aq]Initial repository\[aq]
|
|
[master (root\-commit) eb7facb] Initial repository
|
|
2 files changed, 3 insertions(+)
|
|
create mode 100644 .gitattributes
|
|
create mode 100644 .gitfat
|
|
\f[R]
|
|
.fi
|
|
.PP
|
|
Now we add a binary file whose name matches the pattern we set in
|
|
\f[C].gitattributes\f[R].
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
$ curl https://nodeload.github.com/jedbrown/git\-fat/tar.gz/master \-o master.tar.gz
|
|
% Total % Received % Xferd Average Speed Time Time Time Current
|
|
Dload Upload Total Spent Left Speed
|
|
100 6449 100 6449 0 0 7741 0 \-\-:\-\-:\-\- \-\-:\-\-:\-\- \-\-:\-\-:\-\- 9786
|
|
$ git add master.tar.gz
|
|
git\-fat filter\-clean: caching to /tmp/repo/.git/fat/objects/b3489819f81603b4c04e8ed134b80bace0810324
|
|
$ git commit \-m\[aq]Added master.tar.gz\[aq]
|
|
[master b85a96f] Added master.tar.gz
|
|
git\-fat filter\-clean: caching to /tmp/repo/.git/fat/objects/b3489819f81603b4c04e8ed134b80bace0810324
|
|
1 file changed, 1 insertion(+)
|
|
create mode 100644 master.tar.gz
|
|
\f[R]
|
|
.fi
|
|
.PP
|
|
The patch itself is very simple and does not include the binary.
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
$ git show \-\-pretty=oneline HEAD
|
|
918063043a6156172c2ad66478c6edd5c7df0217 Add master.tar.gz
|
|
diff \-\-git a/master.tar.gz b/master.tar.gz
|
|
new file mode 100644
|
|
index 0000000..12f7d52
|
|
\-\-\- /dev/null
|
|
+++ b/master.tar.gz
|
|
\[at]\[at] \-0,0 +1 \[at]\[at]
|
|
+#$# git\-fat 1f218834a137f7b185b498924e7a030008aee2ae
|
|
\f[R]
|
|
.fi
|
|
.SS Pushing fat files
|
|
.PP
|
|
Now let\[cq]s push our fat files using the rsync configuration that we
|
|
set up earlier.
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
$ git fat push
|
|
Pushing to localhost:/tmp/fat\-store
|
|
building file list ...
|
|
1 file to consider
|
|
|
|
sent 61 bytes received 12 bytes 48.67 bytes/sec
|
|
total size is 6449 speedup is 88.34
|
|
\f[R]
|
|
.fi
|
|
.PP
|
|
We might normally set a remote now and push the git repository.
|
|
.SS Cloning and pulling
|
|
.PP
|
|
Now let\[cq]s look at what happens when we clone.
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
$ cd ..
|
|
$ git clone repo repo2
|
|
Cloning into \[aq]repo2\[aq]...
|
|
done.
|
|
$ cd repo2
|
|
$ git fat init # don\[aq]t forget
|
|
$ ls \-l # file is just a placeholder
|
|
total 4
|
|
\-rw\-r\-\-r\-\- 1 jed users 53 Nov 25 22:42 master.tar.gz
|
|
$ cat master.tar.gz # holds the SHA1 of the file
|
|
#$# git\-fat 1f218834a137f7b185b498924e7a030008aee2ae
|
|
\f[R]
|
|
.fi
|
|
.PP
|
|
We can always get a summary of what fat objects are missing in our local
|
|
cache.
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
Orphan objects:
|
|
1f218834a137f7b185b498924e7a030008aee2ae
|
|
\f[R]
|
|
.fi
|
|
.PP
|
|
Now get any objects referenced by our current \f[C]HEAD\f[R].
|
|
This command also accepts the \f[C]\-\-all\f[R] option to pull full
|
|
history, or a revision to pull selected history.
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
$ git fat pull
|
|
receiving file list ...
|
|
1 file to consider
|
|
1f218834a137f7b185b498924e7a030008aee2ae
|
|
6449 100% 6.15MB/s 0:00:00 (xfer#1, to\-check=0/1)
|
|
|
|
sent 30 bytes received 6558 bytes 4392.00 bytes/sec
|
|
total size is 6449 speedup is 0.98
|
|
Restoring 1f218834a137f7b185b498924e7a030008aee2ae \-> master.tar.gz
|
|
git\-fat filter\-smudge: restoring from /tmp/repo2/.git/fat/objects/1f218834a137f7b185b498924e7a030008aee2ae
|
|
\f[R]
|
|
.fi
|
|
.PP
|
|
Everything is in place
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
$ git status
|
|
git\-fat filter\-clean: caching to /tmp/repo2/.git/fat/objects/1f218834a137f7b185b498924e7a030008aee2ae
|
|
# On branch master
|
|
nothing to commit, working directory clean
|
|
$ ls \-l # recovered the full file
|
|
total 8
|
|
\-rw\-r\-\-r\-\- 1 jed users 6449 Nov 25 17:10 master.tar.gz
|
|
\f[R]
|
|
.fi
|
|
.SS Summary
|
|
.IP \[bu] 2
|
|
Set the \[lq]fat\[rq] file types in \f[C].gitattributes\f[R].
|
|
.IP \[bu] 2
|
|
Use normal git commands to interact with the repository without thinking
|
|
about what files are fat and non\-fat.
|
|
The fat files will be treated specially.
|
|
.IP \[bu] 2
|
|
Synchronize fat files with \f[C]git fat push\f[R] and
|
|
\f[C]git fat pull\f[R].
|
|
.SS Retroactive import using \f[C]git filter\-branch\f[R] [Experimental]
|
|
.PP
|
|
Sometimes large objects were added to a repository by accident or for
|
|
lack of a better place to put them.
|
|
\f[I]If\f[R] you are willing to rewrite history, forcing everyone to
|
|
reclone, you can retroactively manage those files with
|
|
\f[C]git fat\f[R].
|
|
Be sure that you understand the consequences of
|
|
\f[C]git filter\-branch\f[R] before attempting this.
|
|
This feature is experimental and irreversible, so be doubly careful with
|
|
backups.
|
|
.SS Step 1: Locate the fat files
|
|
.PP
|
|
Run \f[C]git fat find THRESH_BYTES > fat\-files\f[R] and inspect
|
|
\f[C]fat\-files\f[R] in an editor.
|
|
Lines will be sorted by the maximum object size that has been at each
|
|
path, and look like
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
something.big filter=fat \-text # 8154677 1
|
|
\f[R]
|
|
.fi
|
|
.PP
|
|
where the first number after the \f[C]#\f[R] is the number of bytes and
|
|
the second number is the number of modifications that path has seen.
|
|
You will normally filter out some of these paths using grep and/or an
|
|
editor.
|
|
When satisfied, remove the ends of the lines (including the \f[C]#\f[R])
|
|
and append to \f[C].gitattributes\f[R].
|
|
It\[cq]s best to \f[C]git add .gitattributes\f[R] and commit at this
|
|
time (likely enrolling some extant files into \f[C]git fat\f[R]).
|
|
.SS Step 2: \f[C]filter\-branch\f[R]
|
|
.PP
|
|
Copy \f[C].gitattributes\f[R] to \f[C]/tmp/fat\-filter\-files\f[R] and
|
|
edit to remove everything after the file name (e.g.,
|
|
\f[C]sed s/ \[rs]+filter=fat.*$//\f[R]).
|
|
Currently, this may only contain exact paths relative to the root of the
|
|
repository.
|
|
Finally, run
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
git filter\-branch \-\-index\-filter \[rs]
|
|
\[aq]git fat index\-filter /tmp/fat\-filter\-files \-\-manage\-gitattributes\[aq] \[rs]
|
|
\-\-tag\-name\-filter cat \-\- \-\-all
|
|
\f[R]
|
|
.fi
|
|
.PP
|
|
(You can remove the \f[C]\-\-manage\-gitattributes\f[R] option if you
|
|
don\[cq]t want to append all the files being enrolled in
|
|
\f[C]git fat\f[R] to \f[C].gitattributes\f[R], however, future users
|
|
would need to use \f[C].git/info/attributes\f[R] to have the
|
|
\f[C]git fat\f[R] fileters run.) When this finishes, inspect to see if
|
|
everything is in order and follow the Checklist for Shrinking a
|
|
Repository (http://www.kernel.org/pub/software/scm/git/docs/git-filter-branch.html#_checklist_for_shrinking_a_repository)
|
|
in the \f[C]git filter\-branch\f[R] man page, typically
|
|
\f[C]git clone file:///path/to/repo\f[R].
|
|
Be sure to \f[C]git fat push\f[R] from the original repository.
|
|
.PP
|
|
See the script \f[C]test\-retroactive.sh\f[R] for an example of
|
|
cleaning.
|
|
.SS Implementation notes
|
|
.PP
|
|
The actual binary files are stored in \f[C].git/fat/objects\f[R],
|
|
leaving \f[C].git/objects\f[R] nice and small.
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
$ du \-bs .git/objects
|
|
2212 .git/objects/
|
|
$ ls \-l .git/fat/objects # This is where the file actually goes, but that\[aq]s not important
|
|
total 8
|
|
\-rw\-\-\-\-\-\-\- 1 jed users 6449 Nov 25 17:01 1f218834a137f7b185b498924e7a030008aee2ae
|
|
\f[R]
|
|
.fi
|
|
.PP
|
|
If you have multiple clones that access the same filesystem, you can
|
|
make \f[C].git/fat/objects\f[R] a symlink to a common location, in which
|
|
case all content will be available in all repositories without extra
|
|
copies.
|
|
You still need to \f[C]git fat push\f[R] to make it available to others.
|
|
.SH Some refinements
|
|
.IP \[bu] 2
|
|
Allow pulling and pushing only select files
|
|
.IP \[bu] 2
|
|
Relate orphan objects to file system
|
|
.IP \[bu] 2
|
|
Put some more useful message in smudged (working tree) version of
|
|
missing files.
|
|
.IP \[bu] 2
|
|
More friendly configuration for multiple fat remotes
|
|
.IP \[bu] 2
|
|
Make commands safer in presence of a dirty tree.
|
|
.IP \[bu] 2
|
|
Private setting of a different remote.
|
|
.IP \[bu] 2
|
|
Gracefully handle unmanaged files when the filter is called (either
|
|
legacy files or files matching the pattern that should some reason not
|
|
be treated as fat).
|