Before this commit, a `_git-pages-repository.<host>` TXT record would
allow both forge DNS allowlist authorization, as well as normal DNS
allowlist authorization. This means that a site set up to have its
contents updated by a Forgejo Action could have its contents replaced
by the contents of the repository which contains the Forgejo Action,
which will effectively erase the site in most cases. This is a classic
confused deputy scenario.
To fix this, forge DNS allowlist authorization now uses a distinct
`_git-pages-forge-allowlist.<host>` TXT record, removing ambiguity
that allows this scenario to happen.
The issue was introduced in 27a6de792c
and existed in `main` for about a hour, so it is unlikely anybody
has been impacted by this.
The new authorization method combines DNS allowlist and existing forge
authorization methods: DNS records are used to determine the allowed
repository URL, and forge authorization is used to check for push
permissions to that URL.
This commit unifies most of the implementation of `AuthorizeDeletion`
and `AuthorizeUpdateFromArchive`, with the latter additionally checking
that the repository URL in the authorization grant follows the limits.
This is done in preparation of adding a second forge authorization
sub-mechanism that can handle non-wildcard domains.
Before:
- not authorized by forge (wildcard)
- cannot check repository permissions: GET https://codeberg.org/api/v1/repos/whitequark/whitequark.codeberg.page returned 401 Unauthorized
After:
- not authorized by forge (wildcard)
- no access to whitequark/whitequark.codeberg.page or invalid token
The actual Codeberg Pages v2 server uses the Forgejo default branch
for the index repository. The quirk previously used the `main` branch
unconditionally.
This is complex to implement, so per discussion with gusted we have
decided to change the default branch to `pages` so that it has parity
with non-Codeberg-specific behavior.
The git-pages webhook security model depends on there being
a 1:1 mapping between site URLs and repositories; being able to
specify multiple of them breaks this model, as anyone could switch
the published site from one to the other if both repositories exist.
This is added to aid migration from Codeberg Pages v2. Forgejo allows
both `_` and `-` in usernames, and it is necessary to be able to accept
host names like `user_name.codeberg.page` under a wildcard domain.
(It is not possible to get a TLS certificate for a host name like this,
so only a wildcard certificate will be able to cover it.)
Previously, you could issue e.g. a `GET /%2e%2e/%2e%2e` and it would
get interpreted as a parent directory path segment in the handler.
This didn't result in a path traversal vulnerability when passed to
the S3 backend because of a `path.Clean()` call indirectly done by
`makeWebRoot()`, but it's prudent to not take chances.
Previously, this would disallow all git clones except for those via
wildcard domains. This is highly unintuitive. It also meant that
disabling this function via environment variable was not possible.
Otherwise, an undesired degree of freedom permits a third party to
deny access to index site URLs by publishing projects with the same
name.
In the future, the _git-pages-repository TXT record format may be
extended to allow non-index sites to be specified without introducing
undesired degrees of freedom.
The HTTP endpoint is `/.git-pages/archive.tar` and it is gated behind
a feature flag `archive-site`. It serially downloads every blob and
writes it to the client in a chunked response, optionally compressed
with gzip or zstd as per `Accept-Encoding:`. It is authorized the same
as `/.git-pages/manifest.json`, for the same reasons.
The CLI operation is `-get-archive <site-name>` and it writes a tar
archive to stdout. This could be useful for an administrator to review
the contents of a site in response to a report.
Both `_headers` and `_redirects` files are present in the output,
reconstituted from the manifest.
This means that e.g. `https://site.tld.` will be treated the same as
`https://site.tld`. In DNS, the trailing empty label means "root domain"
and is usually ignored when present. There are some sites with links
that don't work otherwise.
This is done using reflection to avoid boilerplate and potential desync
of the two configuration interfaces. The `[[wildcards]]` section did
not fit well into the "splat every config key" paradigm, so it is
unmarshalled as a whole from a JSON payload in an environment variable.
This commit also splits up the `Config` type into small per-section
struct types and removes most references to the global `config` in
favor of passing pointers to sections around.
A new option, `-print-config-env-vars`, shows the names and types of
all of the available configuration knobs.
As the rules for serving a site get more complex, being able to see
the git-pages' view of the site structure will become increasingly
valuable.
Unauthorized clients are rejected to make enumeration more difficult.
While git-pages isn't designed to serve sensitive data, it is prudent
to recognize that someone somewhere will do it anyway.