This commit changes the git fetch algorithm to only retrieve blobs
that aren't included in the previously deployed site manifest, if
git filters are supported by the remote.
It also changes how manifest entry sizes are represented, such that
both decompressed and compressed sizes are stored. This enables
computing accurate (and repeatable) sizes even after incremental
updates.
Co-authored-by: David Leadbeater <dgl@dgl.cx>
The former metric was misnamed: it only counted NoSuchKey errors.
Also, it was applied *after* the cache, meaning it was just a count
of every request that got a successful 404 from the S3 backend.
Also, it pooled blob and manifest requests together.
The new metric is 1-to-1 correspondent to S3 requests and distinguishes
between different kinds of errors. Also, it distinguishes kinds of
requests. Example output:
git_pages_s3_get_object_responses_count{code="NoSuchKey",kind="manifest"} 1
git_pages_s3_get_object_responses_count{code="OK",kind="blob"} 1
git_pages_s3_get_object_responses_count{code="OK",kind="manifest"} 1
Otherwise, an undesired degree of freedom permits a third party to
deny access to index site URLs by publishing projects with the same
name.
In the future, the _git-pages-repository TXT record format may be
extended to allow non-index sites to be specified without introducing
undesired degrees of freedom.
The HTTP endpoint is `/.git-pages/archive.tar` and it is gated behind
a feature flag `archive-site`. It serially downloads every blob and
writes it to the client in a chunked response, optionally compressed
with gzip or zstd as per `Accept-Encoding:`. It is authorized the same
as `/.git-pages/manifest.json`, for the same reasons.
The CLI operation is `-get-archive <site-name>` and it writes a tar
archive to stdout. This could be useful for an administrator to review
the contents of a site in response to a report.
Both `_headers` and `_redirects` files are present in the output,
reconstituted from the manifest.
This is to match the behavior of GitHub, as well as because it isn't
particularly useful to serve a file from the index repo with the same
path segment as the project name (and quite confusing too).
This size is not used by git-pages itself, and is not representative of
storage needs, but may be used for estimating how large a site would
be if downloaded in its entirety.
Previously, this method would match only hosts of the form:
user.host.com
This changeset allows matches on hosts of the form:
user.org.host.com
user.organization.com.host.com
This will potentially be the pattern that tangled.org uses for its hosted
instance of git-pages.
Signed-off-by: oppiliappan <me@oppi.li>