Some of these have been around since the Datastore days and are no
longer relevant (dealing with things like Datastore foreign keys). Let's
simplify things.
This works fairly similarly to the registry lock request and
verification mechanism. The request action generates a UUI which is
emailed (in link form) to the user in question. The frontend will send a
request to the verify action with the UUID and hopefully the action
should be finalized.
EPP password requests can be sent by anyone with edit-registrar
permissions and must be approved by an admin POC email.
Registry lock password resets can only be sent by primary contacts, and
are verified/performed by the user in question.
This prohibits all contact data on create and update EPP flows for both domain
and contact flows. It also refactors how default values on FeatureFlags work, as
it's safer to specify a single default on the flag itself rather than have to
specify it independently at a number of callsites (and potentially end up having
an inconsistent value). Domain updates on existing domains that still have
contact data will fail unless all contact data is removed, as a forcing function
to require registrars to rectify the situation prior to being able to do any
other kind of domain changes.
Contact-related flows that are still allowed after this point: Updating a domain
to remove all contacts from it, and deleting a contact object.
This corresponds to the Feb 2024 response profile section 1.2 and
implementation guide 1.3 respectively, now that we comply (or are, at
least closer to complying), with the Feb 2024 versions.
This should probably depend on https://github.com/google/nomulus/pull/2771
because that includes a small change included in the Feb 2024 version
This also updates the documentation to reference the proper areas of the
specifications.
From the response profile:
2.4.6. Registrar URL - The entity with the registrar role in the RDAP response
MUST contain a links member [RFC9083]. The links object MUST contain
the elements: value, identical to the the RDAP Base URL for the
Registrar as provided in the IANA “Registrar IDs” registry (i.e.,
https://www.iana.org/assignments/registrar-ids); rel:about, and href
containing the Registrar URL. Note: in cases where the Registry Operator
acts as sponsoring Registrar (e.g., IANA Registrar ID 9999), the href shall
contain a URL from the Registry.
This is necessary so that the total number of requests/responses adds up
correctly even though some fraction of them are only being recorded. It uses
stochastic rounding so that the totals add up correctly even when the reciprocal
of the ratio isn't an integer.
This is a follow-up to PR #2772.
This ratio defaults to 1.0 (i.e. all metrics will be recorded), but we will set
it much lower in sandbox and production, probably something closer to 0.01. This
will reduce recorded metrics volume and thus StackDriver cost, while still
retaining enough data for overall performance monitoring.
This is handled stochastically, so as to not require any coordination between
Java threads or GKE pods/clusters, as alternative approaches would (i.e. using a
counter and recording every Nth, or throttling to a max metrics qps).
In RDAP, domain queries are the most common by a factor of like 40,000
so we should optimize these as much as possible. We already have an EPP
resource / foreign key cache which does improve performance somewhat but
looking at some sample logs, it only cuts the RDAP request times by like
40% (looking at requests for the same domain a few seconds apart).
History entries don't change often, so we should cache them to make
subsequent queries faster as well. In addition, we're only caching two
fields per repo ID (modification time, registrar ID) so we can cache
more entries than we can for the EPP resource cache (which stores large
objects).
Pubapi actions should always use cache, regardless of the config
settings on caching.
In EppResource.java, the original `loadCached(Iterable<VKey>)`
method is renamed to `loadByCacheIfEnabled`. The original
`loadCached(Vkey)` method is renamed to `loadByCache` and always
uses cache.
In EppResourceUtils.java, the original `loadByForeignKeyCached`
method is renamed to `loadByForeignKeyByCacheIfEnabled`. A new
`loadByForeignKeyByCache` method, which always uses cache.
In ForeighKeyUtils.java, the original `loadCached` method is
renamed to `loadByCacheIfEnabled`, and a new `loadCached` method
is added which always uses cache.
Also added a `getContactsFromReplica` method in Registrar,
for use by RDAP actions.
A future PR will add the actions that save and use this object. That
future PR will also require loading RegistrarPoc objects given the
registrar ID, hence the change in that class.
This implements two type of changes:
1. changing the link type for things like the terms of service
2. adding the request URL to each and every link with the "value" field.
This is a bit tricky to implement because the links are generated in
various places, but we can implement it by adding it to the results
after generation.
See b/418782147 for more information
Add a new DnsWritersModule for use by the component classes.
To override the set of writers installed, we can easily overwrite this
file with a private version.
This is necessary because we'll use primary-contact emails as a way of
resetting passwords.
In the UI, don't allow editing of email address for primary contacts,
and don't allow addition/removal of the primary contact field
post-creation.
In the backend, make sure that all emails previously added still exist.
We're changing the way that allocation tokens work in suboptimal (i.e. incorrect) situations in the domain check, creation, and renewal process. Currently, if a token is not applicable, in any way, to any of the operations (including when a check has multiple operations requested) we return some variation of "Allocation token not valid" for all of those options. We wish to allow for a more lenient process, where if a token is "not applicable" instead of "invalid", we just pass through that part of the request as if the token were not there.
Types of errors that will remain catastrophic, where we'll basically return a token error immediately in all cases:
- nonexistent or null token
- token is assigned to a particular domain and the request isn't for that domain
- token is not valid for this registrar
- token is a single-use token that has already been redeemed
- token has a promotional schedule and it's no longer valid
Types of errors that will now be a silent pass-through, as if the user did not issue a token:
- token is not allowed for this TLD
- token has a discount, is not valid for premium names, and the domain name is premium
- token does not allow the provided EPP action
Currently, the last three types of errors cause that generic "token invalid" message but in the future, we'll pass the requests through as if the user did not pass in a token. This does allow for a default token to apply to these requests if available, meaning that it's possible that a single DomainCheckFlow with multiple check requests could use the provided token for some check(s), and a default token for others.
The flip side of this is that if the user passes in a catastrophically invalid token (the first five error messages above), we will return that result to any/all checks that they request, even if there are other issues with that request (e.g. the domain is reserved or already registered).
See b/315504612 for more details and background
We suspected this could be a cause of optimistic locking failures
(because long transactions would lead to optimistic locks not being
released) but this didn't end up being the case. Let's remove this to
reduce log spam.
1. This doesn't remove the SQL tables yet (this is necessary to pass
tests and also good practice just in case we need or want to look at
history for a little bit)
2. This also removes the Registrar, RegistrarPoc, and User base classes
that were only necessary because we were saving copies of those
objects in the old history classes.
We no longer need to union GKE+GAE logs since we've moved all production
traffic to GKE only.
For testing, I copied the affected *_test.sql files to Bigquery, removed
all the "-alpha" bits, and changed the dates to 20250301 and 20250331
and ran them to make sure they returned the expected data.
Now that we have effective global sessions thanks to #2734, there is no
longer a need to keep the number of pods on the EPP service static.
We are also not vulnerable to random pod restarts. K8s never guarantees
perpetual pod lifetime anyway, and not having to be at its mercy is
certainly a relief.
It turns out period can be used in the URI, such as in
"urn:ietf:params:xml:ns:fee-0.12". I don't think pipe is used, at least
not according to EPP URI namespace naming convention.
Ideally we'd use serialization, but using the default serialization runs
the risk of it being platform/JDK dependent, so a new deployment might
not be able to deserialize existing cookies. A custom serializer that
guarantees stability would have been needed.
This was added back in early 2018 long ago to enable promotions, but
since then (and for many years) we've added the ability to run
promotions on the tokens themselves, rather than relying on custom Java
classes.
This will make the changes for b/315504612 much easier, as that will
split up token validation into "is this token valid in general?" and "is
this token valid for this domain/action?"
This can potentially help even more with serializable transaction
failures (optimistic locking exceptions, which are expected to occur
somewhat frequently).
With six attempts, we will sleep at most five times, for
100+200+400+800+1600 ms each, for a total of at most 3.1 seconds (much
less than the EPP maximum which I believe (?) to be 30 seconds.
In addition, we add a 20% skew in an attempt to spread out
possibly-conflicting transaction retries.
This changes the code to only save console histories of this type. We
keep the old Java code (and, necessarily, the corresponding SQL code)
for now because there's no harm in doing so and we want to avoid hastily
deleting too much.
The SQL statement was incorrectly flooring to zero one layer too deep, which was
negating all negative transaction report rows (which occur most frequently when
a domain in the autorenew grace period is deleted). I've changed it so that it
now only floors to zero at the report level, which still solves the issue
reported in http://b/290228682 but whose original fix caused the issue
http://b/344645788
This bug was introduced in https://github.com/google/nomulus/pull/2074
I tested this by running the new query against the DB for 2024 Q4 using the
registrar that was having issues and confirmed that the total renewal numbers
for .app now match with the sum total of what we invoiced for the last three
months of 2024.
This doesn't check for correctness (we have other scripts that do that)
but just that the service is available at all (the other scripts do not
do that).
This should, and will, be configured with a scheduled trigger in GCB (for us, in
the domain-registry-dev project) and configuration to send some sort of
pub/sub notification on failure (for us, this is already set up on
domain-registry-dev and it sends messages to the "Domain Registry
Notifications" chat channel.
We've seen this issue happen more often than not recently, where GAE
canary deployment is stuck for about 10 min and the failed. The reason
is not clear, but delete the canary version prior to a deployment always
fixes the issue.
We use the $environment property to set the console config. If it is not
given, 'alpha' is used, which has the same effect as 'production'.
TESTED=ran :jetty:copyConsole with
-Penvironment=(sandbox|production|alpha) and checked the resulting js
file.
For GKE all tasks should be on backend, BSA was on its own service
because of egress IP constraint.
Also made it possible to specify a timeout for the Cloud Scheduler job,
with the default (3m) suitable for most tasks.
There are two session cookies, JSESSIONID, which is set by Jetty, and
GCLB, which is set by the Gateway.
In one session, every request other than the first one (the <hello>)
should have the same GCLB value, and every request after a successful
<login> should have the same JSESSIONID.
With these two metadata, we should be able to trace all requests that
*should* belong to the same session and debug issues with session
mismatch (if any).
There can be delays in releasing predicate locks when we have
transactions that are long-lived -- even delays in releasing predicate
locks acquired by shorter-lived transactions. Logging the transaction
duration will allow us to get a sense as to transaction durations during
busy times.
This can happen on public endpoints (in pubapi) where the service is
behind IAP but all users (including not-logged-in ones) are allowed. IAP
will add an OIDC token with no email field in the request header.
This removes logic from an inner helper method so that it becomes more clear
from callsites within each test exactly which behavior is expected from those
test conditions.
We now pin to postgreSQL v17 when running tests, which means that minor
version might increase without our intervention. This causes (at least)
the comment in the golden schema to change, and failing the test as a
result.
This PR adds the ability to strip lines that we deem as comment from the
comparison, so we don't have to do trivial upgrades to the gold schema
whenever there's minor version upgrade.
This allows us to easily tell which tag was deployed.
Also set the gateway to use named address so they are stable, and so
that we can attach an IPv6 record to it. Auto-provisioned addresses are
IPv4 only.
Failed pg_dump may not leave a file, failing the subsequent diffing and
causing the verifier to return success.
The verifier should abort in this case.
GKE logs are routed to a different dataset and the table is different.
The structs to look for are also different (jsonPayload vs textPayload
or protoPayload).
TESTED=Ran the resulting query in crash.
I obtained access to an IBM s390x VM so I thought I'd see how multi-arch
Nomulus is.
Our main application is in Java so it is already multi-arch, but several
tests use docker images that are by default x64. Luckily postgres has an
s390x port, but selenium does not. So I had to disable Screenshot tests
when the arch is not amd64.
This isn't a situation we'll encounter often, but if the client has an
allocation token that's valid for premium domains that gives a 0 cost,
we shouldn't require them to include the fee extension when creating the
domain. We already don't require it for standard domains.
We use this now for the DomainDeleteFlow in an attempt to figure out
what statements it's running (cross-referencing that with PSQL's own
statement logging to find slow statements).
This is kinda nonsensical because this use case is trying to apply a
single use token multiple times in the same domain:check request --
like, trying to use a single-use token for both create, renew, and
transfer while having a $0 create price and a premium renewal price.
This change doesn't affect any actual business / costs, since SPECIFIED
token renewal prices were already set on the BillingRecurrence
Instead of using a separate RenewalPriceInfo object, just map the
behavior (if it exists) onto the BillingRecurrence with a special
carve-out, as always, for anchor tenants (note: this shouldn't matter
much since anchor tenants *should* use NONPREMIUM renewal tokens anyway,
but just in case, double-check).
This also fixes DomainPricingLogic to treat a multiyear create as a
one-year-create + n-minus-1-year-renewal for cases where either the
creation or the renewal (or both) are nonpremium.
We are required to respond to HTTP(S) requests on port 80/443 on the
same domain where we serve port 43 WHOIS requests. The proxy already
does this by redirecting to the web WHOIS lookup page on the marketing
website.
This PR makes it so that requests to port 80/443 can be routed to the
proxy for redirect.
TESTED=tested on crash and the redirect works.
SecretManager based keyring not included in keyring bindings, resulting
in runtime failure.
We should simply keyring bindings. There is no use case for multiple
implementations. See b/388835696.
Knonw nested transact calls found in sandbox have been refactored away.
Enable logging in production to catch any missing cases. Logging is
throttled at 1 message per minute per VM.
k8s does not have a way to expose a global load balancer with TCP
endpoints, and setting up node port-based routing is a chore, even with
Terraform (which is what we did with the standalone proxy).
We will use Cloud DNS's geolocation routing policy to ensure that
clients connect to the endpoint closest to them.
- We never delete rows from DomainHistory (and even if we do in the
future, they'll be old / the references won't matter)
- This is likely creating lock contention when lots of requests come
through at once for domains with many DomainHistory entries
There are some breaking method changes in the 10.x.y versions and we're encountering exceptions when trying to run the flywayMigrate task thanks to those.
For now it only includes two options (domain deletion and domain
suspension). In the future, as necessary, we can add other actions but
this seems like a relatively simple starting point (actions like bulk
updates are much more conceptually complex).
This might help alleviate DB transaction contention on the PollMessage table. A
lower transaction isolation level is safe because acking a poll message is
idempotent: there are only two things it does, either delete a poll message or
take a recurring one from the past and set it to be a year in the future from
the date in the past. Both of these operations will always yield the same final
result even if executed multiple times simultaneously for some reason.
It doesn't need a higher transaction isolation level as it's only loading a given poll
message once, and we want to avoid putting any kind of locks on the PollMessage table
as it seems to be having contention issues. Note that the poll message request flow
is by far the most frequent code that touches the PollMessage table, as there are many
many requests every minute from dozens of registrars, but much fewer poll messages
than that to actually ACK.
We only include the deletion time if the domain is in the 5-day
PENDING_DELETE period after the 30 day REDEMPTION period. For all other
domains, we just have an empty string as that field.
This is behind a feature flag so that we can control when it is enabled
It was failing when any kind of transfer data was present, even completed
transfer data. Note that completed transfer data persists on a domain
indefinitely until/unless a new transfer is requested.
BUG= http://b/377328244
It seems like `/usr/bin/python` is no longer symlinked to the `python3`
binary in the `gcr.io/cloud-builders/git` image.
I've sent out a separate fix to upstream to change the shebang.
https://gerrit-review.git.corp.google.com/c/gcompute-tools/+/439501
But in the meantime, we need this temporary fix for the release to
build.
These flags are suggested by GitHub support to disable reusing caches
during Gradle build. They think that could fix the intermittent error
message:
```
Encountered a fatal error while running "/opt/hostedtoolcache/CodeQL/2.19.0/x64/codeql/codeql database finalize --finalize-dataset --threads=4 --ram=14576 --verbosity=progress++ /home/runner/work/_temp/codeql_databases/java". Exit code was 32 and last log line was: CodeQL detected code written in Java/Kotlin but could not process any of it. For more information, review our troubleshooting guide at https://gh.io/troubleshooting-code-scanning/no-source-code-seen-during-build . See the logs for more details.
```
This fails for me every day for some reason (starting about a month
ago). The same commit went through the workflow fine when the action was
triggered by a push.
I think there's no reason for us to have a cron run as the changes to the
master branch can only come from commit pushes.
Newer versions of GPG (v.2.4.5 in my case) has uses different wording
then what's available in our build image (and Ubuntu I suspect). For
example it says "rsa2048" instead of "2048-bit RSA".
Make the tests work in both cases. Admittedly we cannot check for the
string RSA/rsa easily, but I don't think it matters much for tests.
It looks like /usr/bin/python *may* no longer exists in the latest cloud
builder git image. I ran the latest image and logged into it to verify
that /usr/bin/python3 does exist on 9/25, and again on 9/26 where it
re-appeared.
I think it is generally a good idea to not rely on it being there going
forward.
* Include discount price in domai n pricing
* Partial progress in logic
* Tests and logic passing
* Change pricing for multi year create
* Tests for discount pricing logic
* Token currency check
* Add some comments
* Java formatting
* Discount price to Optional
* Change discount price to be optional nullable
* Re-add deleted tests
Depending on if a "--gke" parameter (must be the last one) is passed,
the deployer constructs the corresponding URIs for GAE or GKE
accordingly.
TESTED=Used the deployer to deploy tasks to alpha and verified that they
run on GKE.
* Drop the `transact` call in Id services
All usages already routed through `tm().allocateId()`, which is
guaranteed to be in a transaction.
* Addressing reviews
Several jars in our dependencies are now multi-release, including
dnsjava and snakeyaml, and a few more. Such jars include
jvm-version-specific classes that will only be loaded by the vm that can
handle them. All it takes is a new manifest attribute.
This change allows us to upgrade to dnsjava3.6+: the base (java 8) version of
this jar breaks java21. The correct manifest allows java21 to find the
classes it needs.
This single metric currently accounts for 22.2% of our total metrics bill,
almost double the size of our EPP requests metric, while also simultaneously
being much less useful. This change reduces the cardinality by removing two
parameters we don't care that much about, which should significantly reduce the
size and thus the cost. If after this change the metric is still too large, I'll
also then remove the matchCount parameter from this metric. We could possibly
even consider deleting the metric in its entirety, as we hardly ever use it.
This PR also removes unused code for premium list metrics that have never
actually been written out (and that we won't bother with at this point).
This will help us if/when we need to run the report generation multiple
times, or for past dates and we don't want to send extra emails or
upload any extra reports to ICANN.
Instead of having to parse the protoPayload.line from the request logs,
we just want to inspect the textPayload from the app logs (stored in a
separate table). This applies to the EPP metrics from the activity
reporting and the attempted-adds column for the transaction reporting.
This doesn't really add any tests, and we'll require many more additions
if we actually want to have full unit testing, but this at least makes
the tests pass when running `npm test`.
Previous PRs and token changes (see b/332928676) have made it so that
SPECIFIED renewalPriceBehavior tokens must have a renewal price. As
such, we can now use that renewalPrice when creating domains with
SPECIFIED tokens.
* Add QPS and incomplete connections metrics to load test client
* Add a failed request count
* Add todos
* Reuse contact
* Add bugs to todos
* small fix
* Clarify QPS
It's valid for the auth data to be null (although it only happens 10 times
across our entire registry), so the domain update flow should not fail out with
a NullPointerException when the existing state of the data is null and the
update isn't adding that data either.
BUG=http://b/359264787
This is the first step in the field removal (second will be removing the
column from SQL once this is deployed).
There's no point in using a UserDao versus just doing the standard
loading-from-DB that we do everywhere else. No need to special-case it.
When creating/deleting users, we need to add/remove the emails in
question to/from the console email group (if it exists). This used to be
done synchronously by calling the Groups API directly from the nomulus
tool. However #2488 made it so that in all cases where group membership
is modified, a Cloud Tasks task is created to execute the change on
the server side asynchronously (because there are multiple places where
this change needs to be done, and it is easier to make it all happen on the
server side).
Alas, as it turns out, Cloud Tasks tasks need to be created with a
service account's credential (which is trivially done on the server side
because the ADC is a service account). Nomulus command runs with a user
credential, and we need to grant the relevant user permission to
masquerade as a service account, in order to enqueue tasks from the
nomulus tool. It is therefore easier to just revert to the old behavior.
Originally, we though that User entities were going to have mutable
email addresses, and thus would require a non-changing primary key. This
proved to not be the case. It'll simplify the User loading/saving code
if we just do everything by email address.
Obviously this doesn't change much functionality, but it prepares us for
removing the id field down the line once the changes propagate.
For instance, on sandbox this will allow us to remove our global roles
but keep roles to the CharlestonRoad admin registrar. Then, when we view
the console, it will be as if we were a registrar user.
This is consistent with how other registries are handling RDAP and is also consistent
with overall behavior in WHOIS and domain info flows as implemented in my previous
PRs #2477 and #2490.
* Check FeatureFlag in domain flows before checking contacts
Check if phase 1 has begun of the transition to the minimum registry dataset, and if it has, do not require the presence of contacts in domain flows.
* Add tests
* Small test fixes
* rename flag
* Fix merge conflicts
* Change todo
* Add isActive methods
* Add javadocs
* small fix
As requested, for registrars participating in these tiered pricing
promos that wish to receive this type of response, we make the following
changes:
1. The pre-promotional (i.e. base tier) price is returned as the
standard domain-create fee when running a domain check.
2. The promotional (i.e. correct) price is returned as a special custom
command class with a name of "STANDARD PROMO" when running a domain
check
3. Domain creates will return the non-promotional (i.e. incorrect) price
rather than the actual promotional price.
This PR does only number 3. See PR #2489 for the others.
As requested, for registrars participaing in these tiered pricing promos
that wish to receive this type of response, we make the following
changes:
1. The non-promotional (i.e. incorrect) price is returned as the
standard domain-create fee when running a domain check.
2. The promotional (i.e. correct) price is returned as a special custom
command class with a name of "STANDARD PROMO" when running a domain
check.
3. Domain creates will return the non-promotional (i.e. incorrect) price
rather than the actual promotional price. This is not implemented in
this PR.
Our logs are getting gummed up with an indefinitely failing and retrying task to
re-save a prober domain that doesn't exist (likely because it was hard-deleted
by delete prober data action), so this makes the re-save action resilient to
that failure case so that it stops assuming every enqueued re-save actually
corresponds to an entity that exists, thus allowing it to fail permanently if
the entity doesn't exist. Failing permanently is the right thing to do as if
the entity doesn't exist now there's no reason to think it will in the future,
plus all re-saves are optimistic rather than guaranteed anyway.
This should fix http://b/350530720
* Add registryTool commands for FeatureFlags
* Fix merge conflicts
* Add required parameters and inject mapper
* Use optionals in cache to negative cahe missing objects
* Fix spelling
* Change back to bulk load in cache
* Add FeatureName enum
* Change variable name
* Use FeatureName in main parameter
This doesn't yet allow them to be absent in EPP flows, but it should make the
code not break if they happen to be null in the database. This is a follow-up to
PR #2477, which ends up being a bit easier because whereas the registrant is
used in more parts of the codebase, the other contact types (admin, technical,
billing) are really only used in RDE, WHOIS, and RDAP, and because they were
already being used as a collection anyway, the handling for if that collection
contains fewer elements or is empty happened to already be mostly correct.
When using this token (which must be tied to a particular domain), the
first year price (and only the first year price, i.e. the creation
price) will be the standard price for this TLD. Future years (i.e.
renewals) will continue at the normal premium price.
This isn't the worst thing in the world but it does result in a bad
request to the server otherwise, and log/error spam. So, only load the
domains list if we have a registrar selected.
This is the first step in migrating to the minimum registration data set. Note
that our database model already permits null domain registrants, so this just
makes the code accept it as well. Note that I haven't changed any requirements
in EPP flows yet; a later step will be to check the migration schedule and then
not require the registrant to be present if in a suitable state.
This does potentially affect the output of WHOIS/RDAP, but that's a NOOP so long
as EPP commands and other tools continue to enforce the requirement of a
registrant.
This is an optional field (will be required when the renewal price
behavior is SPECIFIED). This will allow us to set arbitrary renewal
prices for domains as part of one-off negotiations.
https://b.corp.google.com/issues/332928676
This is the last remaining GAE API that we depend on. By removing it, we are able to remove all common GAE dependencies as well.
To merge this PR, we need to create console User objects that have the same email address as the RegistrarPoc objects' login_email_address and copy over the existing registry lock hashes and salts.
We are also able to simply the code base by removing some redundant logic like AuthMethod (API is now the only supported one) and UserAuthInfo (console user is now the only supported one)
There are several behavioral changes that are worth noting:
The XsrfTokenManager now uses the console user's email address to mint and verify the token. Previously, only email addresses returned by the GAE Users service are used, whereas a blank email address will be used if the user is logged in as a console user. I believe this was an oversight that is now corrected.
The legacy console will return 401 when no user is logged in, instead of redirecting to the Users service login flow.
The logout URL in the legacy console is changed to use the IAP logout flow. It will clear the cookie and redirect the users to IAP login page (tested on QA).
The screenshot changes are mostly due to the console users lacking a display name and therefore showing the email address instead. Some changes are due to using the console user's email address as the registry lock email address, which is being fixed in Add DB column for separate rlock email address #2413 and its follow-up RPs.
GKE now displays log messages correctly. There is no need for an extra
leading newline, which now results in a useless blank line for each log
entry in Log Explorer.
* Create a load testing EPP client.
This code is mostly based off of what was used for a past EPP load testing client that can be found in Google3 at https://source.corp.google.com/piper///depot/google3/experimental/users/jianglai/proxy/java/google/registry/proxy/client/
I modified the old client to be open-source friendly and use Gradle.
For now, this only performs a login and logout command, I will further expand on this in later PRs to add other EPP commands so that we can truly load test the system.
* Small changes
* Remove unnecessary build dep
* Add gradle build tasks
* Small fixes
* Add an instances setUp and cleanUp script
* More modifications to instance setup scripts
* change to ubuntu instance
* Add comment to make ssh work
For reasons unclear to me, if the stack trace is appended directly to
the message, the log entry will be lumped together with following logs
on GKE.
Also updated the GKE service account for Nomulus in the manifest so we
can use workload identity just for Nomulus, not other pods on the same
cluster.
There are a bunch of cases where we want common exception handling and
it's annoying to have to deal with the common "set failed response and
make sure to return" a bunch of times.
This allows us to break up request methods more easily, since we can now
often throw exceptions that will break all the way back up to
ConsoleApiAction. Previously, any error handling had to exist in the
primary handler method so it could return.
The user, on the front end, should not be required to provide whether or
not they're trying to verify a lock or an unlock. They should only need
the verification code. We can inspect the lock object itself (and the
domain in question) to see whether or not we're verifying a lock or an
unlock.
We've added the field in the database in a previous PR. This is only
used in the old console for now because the new console does not have
registry lock functionality yet
* Remove the createBillingCost field from Tld
* fix spacing
* Change field name of map
* Rename getter
* Fix formatting
* Fix todo
* unchange column name
* Add log traces to Nomulus service on GKE
Add request-scope log traces to Nomulus on GKE which, unlike
AppEngine and Cloud Run etc, does not generate traces for hosted
applications. This change only affects the GKE image. It does not affect
the AppEngine services.
Log traces are added to Nomulus-generated logs in request-processing
threads. Forked threads are not covered yet. The single relevant use
case (TimeLimiter) will be addressed in a followup PR.
The main change is in the logging configuration:
* Use gcp-cloud-logging's LoggingHandler
* Add gcp-cloud-logging's TraceLoggingEnhancer to the handler.
* Set a thread-local trace id through the TraceLoggingEnhancer in
ServletBase on request's entry and clear it on completion.
Also removed an unused class (`RequestLogId`).
* CR
* CR
This handles both GET and POST requests. For POST requests it doesn't
actually change anything about the domains because we will need to add a
verification action (this will be done in a future PR).
* Avoid contention over the RefreshDnsRequest table
This table can be small at times, when PSQL may use table scan in
queries by keys. At the SERIALIZABLE isolation level, updates to
unrelated rows may conflict due to this `optimization`.
Lower the isolation level to repeatable read.
* Code review
This includes using the new switch format (though IntelliJ does not yet
understand patterns including default so those aren't used), multiline strings,
replacing some unnecessary type declarations with <>, converting some classes to
records, replacing some Guava predicates with native Java code, and some other
miscellaneous Code Inspection fixes.
This adds an index on transfer_billing_cancellation_id to Domain and superordinate_domain to Host. When tested on crash with the action limited to only delete 10,000 domains, before these indexes were added the action took about 2 hours to delete 10,000 domains. Once these indexes were added, the action was able to delete the 10,000 domains in a little under 2 minutes.
This also creates base classes for the objects contained within the
history classes, e.g. RegistrarBase. This is the same way that objects
stored in the HistoryEntry subclasses have base classes, e.g.
DomainBase.
We cannot rely on the user checking their login email, so we'll want to
send the emails to the other address if configured. This is already the
case in RegistrarPoc.
Use REPEATABLE READ for lock acquire/release operation to avoid conlicts
between locks.
Postgresql uses table scan on small tables, causing false sharing at
the SERIALIZABLE isolation level.
See b/333537928 for details.
Console users need IAP to inject the necessary OIDC tokens into their
request headers and therefore need to be bound to appropriate roles. Note
that in environments managed by latchkey, the bindings will need to be
present in latchkey config files as well, otherwise the changes made by
the nomulus tool will be reverted.
TESTED=ran the nomulus command against alpha and verified that the
bindings are created/removed upon console user creation/deletion.
The existing behavior was to ignore bad header names, in a way that was
counter-intuitive as a user of the Google Sheet. If a header name was bad (which
could just be someone accidentally changing it not realizing it needs to
correspond exactly to the name of the field on the Java object), then all of the
data in that column was just silently left as-is and never updated. This led to
gradually worsening sync and offset shift errors over time.
Now, it will write out an error message into every single cell in the bad
column, so it's clear that the column name is wrong and does not correspond to any
actual data in the DB.
BUG=http://b/332336068
As a read-only action that tolerates staleness, locking is unnecessary.
This should help with the lock contention we are observing.
Also reduces the number of VM instances provisioned for BSA and increase
the idle timeout. This should reduce invocation delay. Longer delay may
cause AppEngine to return `Timeout` status to Cloud Scheduler even
though the cron job succeeds.
This also involved breaking out an improperly done assertThat() helper overload
method for JsonObjects into a proper Subject that doesn't further overload
assertThat().
* Check for missing BSA unblockable domains
All unblockable domains created before the last refresh run should be
reported as unblockable (registered).
All reserved domains that are not registered should be reported as
unblockable (reserved). Note that transient errors may be reported for
newly added reserved domains since we do not maintain update time for
when a reserved label is associated with a TLD. However, this scenario
is extremely rare in operations.
* Addressing review
* Add batching to DeleteProberDataAction
* Only get time once
* Add separate query for dry run
* Update querries to actually properly delete all the data
* Fix merge conflicts
* Add test for foreign key constraints
* Make transaction repeatable read
* Make queries to subtables native
* Add native query for GracePeriodHistory
* Kill job after 20 hours
* remove extra time check from read query
* Add index for domainRepoId to PollMessage and DomainHistoryHost
* Add flyway fix for Concurrent
* fix gradle.properties
* Modify lockfiles
* Update the release tool and add IF NOT EXISTS
* Test removing transactional lock from deploy script
* Add transactional lock flag to actual flyway commands in script
* Remove flag from info command
* Add configuration for integration test
* Verify unblockables are truly unblockable
Unblockable domains may become blockable due to deregistration or
removal from the reserved list. The BSA refresh job is responsible
for removing them from the database. This PR verifies that the refreshes
are correct.
Note that recent changes since last refresh are not reflected in the
result, and inconsistency due to recent deregistrations are ignored.
Changes in reserved status or IDN validity are not timestamped,
therefore we cannot ignore recent inconsistencies. However, these
changes are rare.
* Addressing code review
* Addressing code review
Upgrade to using Jakarta EE 10 from Java EE 8 by mostly following the upgrade instructions. Only the servlet package is upgrade. Other Jakarta EE components (like the persistence package that Hibernate depends on) need to be upgraded separately.
TESTED=deployed and successfully communicated with the pubapi endpoint for web WHOIS.
Note that this currently requires packaing the App Engine runtime per instructions here due to GoogleCloudPlatform/appengine-java-standard#98. This PR will only be merged until the fix is deployed to production (https://rapid.corp.google.com/#/release/serverless_runtimes_run_java/java21_20240310_21_0).
We don't use the upload results feature (kokoro picks the results
artifacts directly and uploads them).
Keeping it around is a maintenance burden.
Also fixed a deprecation warning.
Labels that are not in any supported IDN are not added to the database.
Remove such labels from those loaded from the block list files before
comparing with DB.
There is one remaining instance in JpaTransactionManagerImpl that cannot
be removed because DetachingTypedQuery is implementing TypedQuery, which has
a method that expectred java.util.Date.
Console tests fail for the files that are affected by redesign. There's no point in fixing it here. I will reenable the task after the console redesign PR is merged
Note that Dagger currently doesn't work with the Jakarta namespace and
we have to cap the jakarta inject package version below 2.0 so that it
sill provides classes in the old namespace.
Removed the deprecation mark as it is natural to expose methods related
to a transaction like getting the entity manager or checking if one is
in a transaction through the transaction manager interface.
Also only caches/resets the original TM when in unit tests (TBT I'm not so sure
that even this is necessary as we don't seem to call the tool from tests
that often. There is only ShellCommandTest that calls the run() function
in RegistryCli and we could just put these tests in fragileTest and make
them run sequentially and fork every time to get around issue with
inference).
The issue with caching is that it tries to first create the to-be-cached
TM, and when the environment given is prod/sandbox/... It will try to
retrieve SQL credentials from prod/sandbox/... secret manager. This
works fine locally as we all have access to prod/sandbox/..., but fails
in Cloud Build jobs such as sync-db-objects where it provides it own
credential that has direct SQL access, but not access to
prod/sandbox/... secret manager.
TESTED=ran `./gradlew devTool --args="-e localhost generate_sql_er_diagram -o ../db/src/main/resources/sql/er_diagram"`
* Add a GitHub Action workflow
This allows us to create Gradle depedency graphs for Dependabot analysis (as the ones we already get for Javascript dependencies).
* Update Java version
* Add build scan
* codeql 3
* run with gradle
* exclude jIFC
* build scan
* Finalize
Chose the default transaction manager based on RegistryEnvironment. This
makes it possible to run nomulus on Jetty locally. Tested with the
following:
```bash
./gradle :jetty:run -Penvironment=alpha
curl http://localhost:8080/beta.app
```
The docker image is also updated to take an argument that specifies the
environment. It runs locally as well but the container doesn't get
access to locally stored credentials, so it fails to initialize the
transaction manager.
* Add BSA validation job
Add the BsaValidateAction class with a first check (for inconsistency
between downloaded and persisted labels).
* Addressing comments
* Addressing reviews
The staging job runs at 9AM on the 2nd day of each month, we should set
the cursor to be after that time, otherwise we attempt to upload reports
on the 1st day of each month before they are ready, causing an error
email to be sent to us.
Remove email classes that depend on AppEngine API. They have been
replaced by the gmail-based client.
Remove `EmailMessage.from` method, which is no longer used.
There is a fixed sender address for the entire domain, and is
set by the gmail client.
The configs remain to be cleaned up. There is a bug (b/279671974) that
tracks it.
This PR makes the runtime of most of our workload Java 21.
1. App Engine. Java 21 is in GA and it supports Java EE 8. I had to add
an environmental variable so that we don't get an
AppEngineCredentails by default (we have been using
ComputeEngineCredentials for a couple of years). The uprade to Java
21 runtime changed a system property that controls how jetty logging
works, which also control if AppEngineCredential is return. Tested by
deploying to alpha.
2. Proxy base image upgradedd to Java 21 (distroless still doesn't
support Java 21 and it looks like Temurin is the way to go
b/306728455). Tested by deploying to alpha.
3. Nomulus tool image upgrade to Temurin 21 as well. Tested locally.
4. Beam pipeline base image upgrade to Java 21. The JAVA21 flag is not
supported by gcloud yet, but specifying the image URL directly works
(and is supported). Tested by running in alpha.
5. Jetty base image upgraded to Java 21. Tested locally.
Make the necessary changes for the code base to compile with JDK 21.
Other changes:
1. Upgraded testcontainer version and the SQL image version (to be the
same as what we use in Cloud SQL). This led to some schema changes and
also changed the order of results in some test queries (for the
better I think, as the new order appears to be alphabetical).
2. Remove dependency on Truth8, which is deprecated.
3. Enable parallel Gradle task execution and greatly increased the
number of parallel tests in standardTest. Removed outcastTest.
This PR creates a unified RegistryServlet that will serve all
non-console traffic. It also creates a jetty subproject that allows one
to run Nomulus on top of a standard Jetty 12 runtime.
`./gradlew :jetty:stage` will create a jetty base folder at
`jetty/build/jetty-base` where one is able spin up a local Nomulus server
by running the following command inside the folder:
```bash
java -jar ${JETTY_HOME}/start.jar
```
`JETTY_HOME` is a folder where the [Jetty runtime](https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-home/12.0.6/jetty-home-12.0.6.zip) is located.
This PR also adds a Gradle task to create a Nomulus image based on the
official Jetty image:
```bash
./gradlew :jetty:buildNomulusImage
```
This reverts commit 64f5971275.
The catch block is too broad and most of the times the errors caught is
because `command.run()` failed and it had nothing to do with getting
the transaction manager. The `runCommand` method is already wrapped in a try
block that checks for `LoginRequiredException` and gives the appropriate
error message.
We need to re-assess the situation when the next time we encounter a
login issue that did not trigger `LoginRequiredException`. A blanket try
catch block is not the solution and only makes the situation more
confusing.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/2342)
<!-- Reviewable:end -->
* Add a check so newly saved createCostTransitions get recognized and saved to the database
* Fix equals check
* Rename equals method
* Add comment explaining need for createBillingCostTransitionEqualCheck
* Remove use of shouldPublishField from ReservedList
* Remove from tests
* Update test comment
* Fix indentation
* fix test comment
* Fix test
* fix test
* Make shouldPublish column nullable
We introduced Scrypt as the default password hashing algorithm in
November 2023 and have been auto-converting saved hashes whenever a
successful EPP login or registry lock/unlock request is processed.
We will send comms to registrars to inform them the upcoming removal of
SHA256 support and urge them to log in at least once before the change.
Otherwise, they will need to contact support to reset the password out of
band after the change.
This PR will NOT be submitted until comms are out and the effective date
is immediate.
Co-authored-by: Weimin Yu <weiminyu@google.com>
This makes the callsites look neater, as the work to execute itself is often a
many line lambda, whereas the transaction isolation level is not more than a
couple dozen characters.
The Distinct transform removes duplicates based on the serialized format
of the elements. By providing a deterministic coder, we can guarantee
that no duplicates exist.
This is the start of a long and low priority migration, but for now I wanted to do a couple of them just to see what it looks like.
This also demonstrates the pattern for use of an @AutoBuilder to replace an @AutoValue.Builder. See https://github.com/google/auto/blob/main/value/userguide/records.md#builders for full details on that.
This removes a dependency on the App Engine SDK. It also looks like
(from the logs at least) that shutdown hooks registered the old way stopped
working after the runtime is upgraded to Java 17.
Also removed some random leftover dependencies on the App Engine SKD
that are not needed any more.
* Add java changes for createBillingCostTransitions
* Add negative cost test
* Remove default value
* remove unused variable
* Add check that create cost and trnasitions map are the same
* inject clock, only use key set when checking for missing fields
* Add test for removing map
* Add ALLOW_BSA allocation type
Add a new type to allow creation of domains blocked by BSA.
Except for the BSA semantics, the new type behaves exactly
like SINGLE_USE.
* Addressing reviews
* Addressing review
* Change tld-update to db-object-updater
* rename sync_tlds.sh to sync_db_objects.sh
* Change to configured command name
* Change environment to sandbox explicitly for testing on alpha
* Add remaining object steps and change cloudbuild-tld-sync to cloudbuild-sync-db-objects
* Add build_environment flag
* Change order of command and directory
* Uncomment out reserved list part
Query size is borderline too-large for the replica.
At 50000, about 2 out of 30 took more than 30 seconds and were retried.
Lower to 40000 and we will keep monitoring the executions.
The fallback should only apply on creates, not on updates, otherwise it can
override an existing value for the email address when only the referral email
should be what's updated.
This fixes a bug introduced back in commit in 0ead4f8d9d.
BUG= http://b/322026165
* Stop saving BSA empty refresh changes
We thought that as a way to verify the refresh job to be running, browsing
the GCS bucket with empty files is easier than quering the DB or go to GCP
logging dashboard, but there are too many of them to be useful.
* Add some updates to UpdateReservedListCommand to facilitate internal config presubmits and syncing
Added a dry-run tag for presubmit tests
Added early exit behavior when there are no new changes to the list
Added a new --build_environment tag to be used to indicate command runs from build tools. This tag was also added to UpdatePremiumListCommand. Once this new tag is deployed, and break glass behavior is added, these commands will be modified to prevent runs on the command line in the production environment unless the --build_environment or --break_glass flag is used.
* Fix capitalization
* Added in commented out production environment check for buildEnv flag
The blocking issue is fixed in
https://github.com/google/nomulus/pull/2224.
Java 8 support is being deprecated on 2024-01-31 and no further deployment is
possible afterwards without exception:
https://cloud.google.com/appengine/docs/legacy/standard/java/deprecations
We have been using Java 17 on alpha/crash/qa for several months and have
not oberved any other blocking issue other than possible missing email
attachements, which is being mitigated by including a link to the
attachments saved in GCS.
* Fix BsaRefreshAction bugs
Added functional tests for BsaRefreshAction, which checks for changes in
domain registration and reservation, and apply them to the Unblockable
domain list.
Fixed a few bugs exposed by the tests.
Also refactored a few other tests.
This also moves it back to the replica transaction manager now that it shouldn't be timing
out its queries.
And this adds a test as well (more to come!).
* Define the --end_breakglass and --build_environment flags
It is necessary to define these flags in a deployment before merging go/r3pr/2273 in order to prevent breaking the exisitng TLD syncing and entity presubmit testing that has already been enabled
* make break glass 2 words
* Change break_glass flag to take a Boolean and use false value to end break glass mode
* small fixes
* Fix spacing
* Add missing G
* Add clarifying comment
* Refactor a few BSA unit tests
Added a few helpers for managing reserved list in tests and updated the
tests to use them.
Also fixed a bug: when quering for newly created domains, the query
should be restricted to bsa-enrolled tlds.
* Remove create_tld and update_tld commands
These commands are no longer necessary now that configure_tld command is available. However, the configure_tld command should only be used for crash, QA, and alpha environments. TLDs in production and sandbox must be modified using modifications to their config files in Gerrit unless using the configure_tld command in breakglass mode. Check the "How to configure TLDs" procedure doc for more info.
* re-delete file
- Registered names are not affected.
- Reserved names are not affected.
- Names that are none of the above and match some BSA labels are
reported as blocked.
Also fixes the action taken in the case where zero unavailable domains are
found, and temporarily changes over to using the primary DB (because the replica
transaction was timing out at 30 seconds on large databases). I'll switch this
over to use batching and move it back to replica afterwards, but this should
unblock us temporarily.
This PR makes it possible to build the Nomulus code base using Java 17.
Building with Java 11 continue to be possible and the resulting bytecodes are
still at Java 8 level. Also upgraded Gradle to 8.5.
There are several necessary changes to make this happen:
1. Some Gradle plugins need to be upgraded to support Java 17, notably
errorprone. As a result, a lot more "errors" were caught and corrected.
2. All test code are now built and run at Java 8 level. Previously it was left
undefined (which defaults to the version of the compiler) and had led to
situations where we inadvertently called Java 8+ features in production that
are not caught by tests. The change also made the java8compatibility subproject
obsolete, which is therefore removed.
3. Removed the docs subproject. Its main use is to generate flows.md, but it
relies heavily on Java internal APIs that have changed significant with each
version. Upgrading to Java 11 required extensive refactoring of the code there,
and Java 17 again removed many APIs that were used. I don't think it is worth
the maintenance effort just to have a tool to generate flows.md which no one
actually reads.
4. Capped a few GCP dependencies because the latest version depends on
grpc-java >= 1.59.0, which includes a runtime incompatibility
(https://github.com/grpc/grpc-java/releases/tag/v1.59.0).
* Check BSA block status in CheckApi
Checks for and reports BSA block status if the name is not registered or
reserved.
Also moves CheckApiActionTest to standardTest. Whatever problem forcing
it to another suite has apparently disappeared.
Supports the full blocklist download cycle (download, diffing, diff-apply, and order-status reporting) and the refreshing of unblockable domains.
Submitted due to tight deadline. We will conduct post-submit review and refactoring.
SCRYPT is much computationally heavier than SHA265 (by design), which
resulted in test run time doubling due to most tests initializing canned
data that uses hashing.
Since out tests are not verifying the correctness of a specific hashing
algorithm anyway, this PR makes it so that simple concatenation is used
in tests.
Also moved RegistryEnvironment to the util subproject so it can be called by
PasswordUtils, which makes sense as it is a utility class.
* Add BigInt conversion to TimedTransitionProperty<Money> deserializer to handle JPY currency
* Remove unnecessary lines in test
* Add eap schedule check
* Don't use raw LinkedHashMap type
* add timezone
From our investigation, the Monday night WHOIS storm does not cause any
strain to the backend system. The backend latency metrics are all well within
the limits. The latency measured from the proxy matches observed latency
by the prober, and we see that the "used" CPU is 1.5x of "requested" CPU
during the time when the latency is above the threshold.
Making this change hopefully removes the proxy as the bottleneck and
ameliorate the pages.
Add the BsaDomainRefresh class which tracks the refresh actions.
The refresh actions checks for changes in the set of registered and
reserved domains, which are called unblockables to BSA.
Currently, a verify action is enqueued every time the upload method
succeeds. Because the upload job is wrapped in a transaction, the
same task will be enqueued again if the transaction retries.
We cannot move the upload method outside the transaction because the
read-upload-write logic needs to be atomic, and the upload part itself
is idempotent (therefore retri-able). We can, however, move the
enqueuing part outside the transaction as we only need to enqueue the
verify task once the transaction succeeds. This should fix the issue
where multiple verify jobs try to hit the same marksdb endpoints,
resulting in 429 (Too Many Requests) errors.
* Add a dryrun tag to UpdatePremiumListCommand and early exit command if no new changes to the list
* Change prompt string when no change to list to reflect that there is no actual prompted user input
* Add camelCase and correct flag name
This might be the cause of the SQL performance degradation that we are
observing during the recent launch. The change went in a month ago but
there hasn't been enough increase in mutating traffic to make it
problematic until the launch.
Note that presubmits should run faster too with this chance, which
serves as an evidence that excessive logging is the culprit.
Add the BsaDomainRefresh table which tracks the refresh actions.
The refresh actions checks for changes in the set of registered and
reserved domains, which are called unblockables to BSA.
This doesn't fix any issues with dead/livelocks when deleting or
updating allocation tokens, but it at least will significantly reduce
the time to load the tokens that we'll want to update/delete.
The code as previously written assumed that creation fees would be the
same as renewal fees -- this is not the case for anchor tenants, where
the renewal fee is always the standard cost for the TLD (instead of any
premium cost). This was already handled properly in the actual billing
implementation, but we didn't tell the user the right renewal cost in
domain checks.
This also removes some warning logs related to nested transactions
We shouldn't have to parse through every single entry to see what
changed
Note: we don't do this for premium lists because those can be HUGE and
we don't want/need to load and display every entry. This was an explicit
choice made in https://github.com/google/nomulus/pull/1482
Since the replica SQL instance is read-only, any transaction performed
on it should be explicitly read-only, which would allow PostgreSQL to
optimize away (some) use of predicate locks.
Also changed the EPP cache to read from the replica. The foreign key
cache already behaves this way.
See: https://www.postgresql.org/docs/current/transaction-iso.html
For reasons unclear at this point, Java 17's servlet implementation on
GAE injects IP addresses (including unroutable private IPs) into the
standard X-Forwarded-For header, which we currently use to embed
registrar IP addresses to check against the allow list. This results in
the server not properly parsing the header and rejecting legitimate
connections.
This PR sets a custom header that should not be interfered with by any
JVM implementation to store the IP address, while maintaining the old
header as a fallback. The proxy will set both headers to allow the
server to gracefully migrate from Java 8 and Java 17 (and potentially
rollback).
Also removed some headers and logic that are not used.
This does not include any styling for now, just wanted to make sure
we're all good with regards to the basic approach. I'm open to suggestion on
which columns to include.
Note: filter searching is not implemented yet because the backend does
not allow for it (yet)
Java 17 injects unexpected headers to X-Forwarded-For, which causes
issues with validating incoming IP addresses.
This is a partial reversion of #2201. We are still keeping Java 17 in other environment but sandbox and production needs to be able to parse the header to accept incoming EPP connections from registrars. Once we fix it we will re-enable Java 17 in these environment.
This allows us to switch the proxy to a different client ID without
disrupting the service. This is a temporary measure and will be removed
once the switch is complete.
Also adds a placeholder getter in the Tld class, so that it can be
mocked/spied in tests. This way more BSA related code can be submitted
before the schema is deployed to prod.
This also adds a targeted QPS as a parameter in case we need to manually bump it
up (or down) for some reason without having to make code changes and re-deploy.
This PR makes a few changes to make it possible to turn on
per-transaction isolation level with minimal disruption:
1) Changed the signatures of transact() and reTransact() methods to allow
passing in lambdas that throw checked exceptions. Previously one has
always to wrap such lambdas in try-and-retrow blocks, which wasn't a
big issue when one can liberally open nested transactions around small
lambdas and keeps the "throwing" part outside the lambda. This becomes a
much bigger hassle when the goal is to eliminate nested transactions and
put as much code as possible within the top-level lambda. As a result,
the transactNoRetry() method now handles checked exceptions by re-throwing
them as runtime exceptions.
2) Changed the name and meaning of the config file field that used to
indicate if per-transaction isolation level is enabled or not. Now it
decides if transact() is called within a transaction, whether to
throw or to log, regardless whether the transaction could have
succeeded based on the isolation override level (if provided). The
flag will initially be set to false and would help us identify all
instances of nested calls and either refactor them or use reTransact()
instead. Once we are fairly certain that no nested calls to transact()
exists, we flip the flag to true and start enforcing this logic.
Eventually the flag will go away and nested calls to transact() will
always throw.
3) Per-transaction isolation level will now always be applied, if an
override is provided. Because currently there should be no actual
use of such feature (except for places where we explicitly use an
override and have ensured no nested transactions exist, like in
RefreshDnsForAllDomainsAction), we do not expect any issues with
conflicting isolation levels, which would resulted in failure.
3) transactNoRetry() is made package private and removed from the
exposed API of JpaTransactionManager. This saves a lot of redundant
methods that do not have a practical use. The only instances where this
method was called outside the package was in the reader of
RegistryJpaIO, which should have no problem with retrying.
We have been using SHA256 to hash passwords (for both EPP and registry lock),
which is now considered too weak.
This PR switches to using Scrypt, a memory-hard slow hash function, with
recommended parameters per go/crypto-password-hash.
To ease the transition, when a password is being verified, both Scrypt
and SHA256 are tried. If SHA256 verification is successful, we re-hash
the verified password with Scrypt and replace the stored SHA256 hash
with the new one. This way, as long as a user uses the password once
before the transition period ends (when Scrypt becomes the only valid
algorithm), there would be no need for manual intervention from them.
We will send out notifications to users to remind them of the transition
and urge them to use the password (which should not be a problem with
EPP, but less so with the registry lock). After the transition,
out-of-band reset for EPP password, or remove-and-add on the console for
registry lock password, would be required for the hashes that have not
been re-saved.
Note that the re-save logic is not present for console user's registry
lock password, as there is no production data for console users yet.
Only legacy GAE user's password requires re-save.
The domains (and their associated billing recurrences) were already being loaded
to check restores, but they also now need to be loaded for renews and transfers
as well, as the billing renewal behavior on the recurrence could be modifying
the relevant renew price that should be shown. (The renew price is used for
transfers as well.)
See https://buganizer.corp.google.com/issues/306212810
Both actions have not been used for a while (the wipe out action
actually caused problems when it ran unintentionally and wiped out QA).
Keeping them around is a burden when refactoring efforts have to take
them into consideration.
It is always possible to resurrect them form git history should the need
arises.
We finally fixed Spinnaker (I hope) to deploy bundled services with Java
17 runtime. Note that the bytecodes are still targeting Java 8. The only
change this PR introduces is to switch the runtime environment to Java
17.
TESTED=deployed to crash.
This was already set to true in all environments except prod last week. Now that the release has gone out and we have not seen any issues, we should feel safe turning this on in production as well.
We previously needed to use URLFetch in some instances where TLS 1.3 is
required (mostly when connecting to ICANN servers),and the JDK-bundled SSL
engine that came with App Engine runtime did not support TLS 1.3.
It appears now that the Java 8 runtime on App Engine supports TLS 1.3
out of the box, which allows us to get rid of URLFetch, which depends on
App Engine APIs.
Also removed some redundant retry and logging logic, now that we know
the HTTP client behaves correctly.
TESTED=modified the CannedScriptExecutionAction and deployed to alpha, used the
new HTTP client to connect to the three URL endpoints that were
problematic before and confirmed that TLS connections can be established. HTTP
sessions were rejected in some cases when authentication failed, but
that was expected.
* Add a cloudbuild-tld-sync job
This job checks the Tld config files in the internal repo and syncs them with the actual Tld objects in the database using the configure_tld numulus command.
* Add the dockerfile and shell script
* Force the command
* Add comments
* add newline
* Create a separate copy of the job for each environment
* fix file name
* Fix indentation
* Lower the isolation level for RefreshDnsForAllDomainsAction
This lowers the isolation level to TRANSACTION_REPEATABLE_READ which will hopefully allow the action to run the entire action without timing out on our larger TLDs.
* Unchange default config
Cache loads will likely always be inner transactions, if they have a transaction at all. Cache loads do not always call a transaction since they are only necessary if the cache is not fresh at the time it is called. Since the cache itself needs to decide whether or not a DB transaction is necessary, it should use the reTransact method to safely indicate that the isolation level of the outer transaction is what should be used.
If the cert(s) are invalid or expired that's a problem, but that
shouldn't necessarily prevent us from changing other things. If we're
not changing the certs, leave them alone.
If we don't explicitly handle random unexpected exceptions, the error
that the front end receives includes a big ole stacktrace, which is
unhelpful for regular users and possibly bad to expose. Instead, we
should provide a vague "something went wrong" message.
Separately, we can create a default SnackBar options and use that (we
want it longer than 1.5 seconds because that's pretty short).
Also made some refactoring to various Auth related classes to clean up things a bit and make the logic less convoluted:
1. In Auth, remove AUTH_API_PUBLIC as it is only used by the WHOIS and EPP endpoints accessed by the proxy. Previously, the proxy relies on OAuth and its service account is not given admin role (in OAuth parlance), so we made them accessible by a public user, deferring authorization to the actions themselves. In practice, OAuth checks for allowlisted client IDs and only the proxy client ID was allowlisted, which effectively limited access to only the proxy anyway.
2. In AuthResult, expose the service account email if it is at APP level. RequestAuthenticator will print out the auth result and therefore log the email, making it easy to identify which account was used. This field is mutually exclusive to the user auth info field. As a result, the factory methods are refactored to explicitly create either APP or USER level auth result.
3. Completely re-wrote RequestAuthenticatorTest. Previously, the test mingled testing functionalities of the target class with testing how various authentication mechanisms work. Now they are cleanly decoupled, and each method in RequestAuthenticator is tested individually.
4. Removed nomulus-config-production-sample.yaml as it is vastly out of date.
Surround the dot in unsafe domain names with a square bracket. This
is suggested by Gmail abuse-detection and allows outgoing messages
to pass Gmail's check. This should also help with recipients' checks.
Adds a delay between emails sent in a tight loop. This helps avoid
triggering Gmail abuse detections.
Also updated the recipient address for billing alerts.
* Add custom YAML serializer for Duration
This addresses b/301119144. This changes the YAML representation of a TLD to show Duration fields as a String reperesntation using the Java Duration object's toString() format. This eliminates the previous ambiguity over the time unit that is being used for each duration.
* change standardSeconds to standardMinutes in test
* Add custom serializer to the entire mapper
Unless an --oauth flag is used, the nomulus tool will only send the OIDC
header. The server still accepts both headers and the user should use
`create_user` command to create an admin User (with the --oauth flag on), which
will then allow one to use the nomulus tool without the --oauth flag.
The --oauth flag and the server's ability to support OAuth-based
authentication will be removed soon. Users are urged to create the User
object in time to avoid service interruption.
TESTED=verified on alpha.
We want to do this because it takes the user to an external site, which
could potentially lead to confusion if they tried to use the back button
without a new tab.
Add documentation that describes the current Cloud Build status notification
to Google Chat, as well as how to update the configuration and the
notifier service.
This isn't the prettiest thing, but it replicates the type of view /
edit functionality that we had in the original console.
Of note: this doesn't include input field validation, which would
probably be a good idea to add at some point.
It will call equalsImmutableObject(), which seems the right thing to do.
We only care if the two Tld objects have the same fields, not if they
are the same object. ErrorProne complained about comparison by identity.
* Defend against deserialization-based attacks
Added the `SafeObjectInputStream` class that defends attacks using
malformed serialized data, including remote code execution and
denial-of-service attacks.
Started using the new class to handle EPP resource VKeys and
PendingDeposits, which are passed across credential-boundaries: between
TaskQueue and AppEngine server, and between AppEngine server and the RDE
pipeline on GCE. Note that the wireformat of VKeys do not change,
therefore existing tasks sitting in the TaskQueue are not affected.
Also removed an unused class: JaxbFragment.
This token is only ever used for logging. The GAE OAuth service will
parse the header directly when called to retrieve the current user and
user id. Logging it in prod could be a security risk if the logs are
leaked.
Previously this didn't properly deal with nested routings, e.g.
"settings/whois". It tried to just pass "whois" as the next url which
doesn't work with the router because it's nested under the settings.
Using all parts of the URL allows us to handle the nesting.
* Add a configureTld command that uses YAML
* Add more tests and edge case handling
* Add out of order test and fix wrong inject
* small changes
* Add check for ascii
* Add check for ROID suffix
Hibernate logs certain information at the ERROR level, which for the
purpose of troubleshooting is misleading, since most affected operations
succeed after retry. ERROR-level logging should only be added by Nomulus
code.
This PR does two things:
1. Disable all logging in two Hibernate classes: we cannot disable
logging at a finer granularity, and we cannot preserve lower-level
logging while disabling ERROR.
2. Adds a DatabaseException which captures all error details that may
escape the typical loggers' attention: SQLException instances can be
chained in a different way from Throwable's `getCause()` method.
The old semantics for TransactionalFlow meant "anything that needs to mutate the
database", but then FlowRunner was not creating transactions for
non-transactional flows even though nearly every flow needs a transaction (as
nearly every flow needs to hit the database for some purpose). So now
TransactionalFlow simply means "any flow that needs the database", and
MutatingFlow means "a flow that mutates the database". In the future we will
have FlowRunner use a read-only transaction for TransactionalFlow and then a
normal writes-allowed transaction for MutatingFlow. That is a TODO.
This also fixes up some transact() calls inside caches to be reTransact(), as we
rightly can't move the transaction outside them as from some callsites we
legitimately do not know whether a transaction will be needed at all (depending
on whether said data is already in memory). And it removes the replicaTm() calls
which weren't actually doing anything as they were always nested inside of normal
tm()s, thus causing confusion.
In the future, reTransact() will be the only way to initiate a transaction that
doesn't fail when called inside an outer wrapping transaction (when wrapped,
it's a no-op). It should be used sparingly, with a preference towards
refactoring the code to move transactions outwards (which this PR also
contains).
Note that this PR includes some potential efficiency gains caused by existing
poor use of transactions. E.g. in the file RefreshDnsAction, the existing code
was using two separate transactions to refresh the DNS for domains and hosts
(one is hidden in loadAndVerifyExistence(), whereas now as of this PR it has a
single wrapping transaction to do so.
It turns out that disallowing all nested transaction is impractical. So
in this PR we make it possible to run nested transactions (which are not
really nested as far as SQL is concerned, but rather lexically nested
calls to tm().transact() which will NOT open new transactions when
called within a transaction) as long as there is no conflict between the
specified isolation levels between the enclosing and the enclosed
transactions.
Note that this will change the behavior of calling tm().transact() with
no isolation level override, or a null override INSIDE a transaction.
The lack of the override will allow the nested transaction to run at
whatever level the enclosing transaction runs at, instead of at the
default level specified in the config file.
* Fix Cloud Tasks failure to retry
Replace `SC_NOT_MODIFIED` (304) with `SC_SERVICE_UNAVAILABLE` (503) when
data is not available yet. Affected actions are invoice- and spec11-publishing.
It is confirmed that Cloud Tasks currently does not retry with code 304,
despite the public documentation stating so. We will use 503 for now,
pending the decision by Cloud Tasks whether to change behavior or
documentation.
The code `TOO_EARLY` (425) is another alternative. It is not meant for
our use case but at least sounds like it is. However, it is not in any
javax.servlet jar. We don't want to define our own constant, and we cannot upgrade
to jakarta.servlet yet.
Also revert previous mitigation.
This includes a bit of refactoring of the GSON creation. There can exist
some objects (e.g. Address) where the JSON representation is not equal to the
representation that we store in the database. For these objects, when
deserializing, we should update the objects so that they reflect the
proper DB structure (indeed, this is already what we do for the XML
parsing of Address).
* Change PackagePromotion to BulkPricingPackage
* More name changes
* Fix some test names
* Change token type "BULK" to "BULK_PRICING"
* Fix missed token_type reference
* Add todo to remove package type
A config file field is added to control if per-transaction isolation
level is actually used. If set to true, nested transactions will throw
a runtime exception as the enclosing transaction may run at a different
isolation level.
In this PR we only add the ability to specify the isolation level,
without enabling it in any environment (including unit tests), or
actually specifying one for any query. This should allow us to set up
the system without impacting anything currently working.
* Mitigate Cloud task retry problem
Increase PublishSpec11Action start delay to avoid the need to retry.
The only other use case is invoice, which typically does not retry:
delay is 10 minutes, pipeline finishes within 7 minutes.
This includes two changes:
1. Creating a base string-type adapter for use parsing to/from JSON
classes that are represented as simple strings
2. Changing the object-provider methods so that the POST bodies should
contain precisely the expected object(s) and nothing else. This way,
it's easier for the frontend and backend to agree that, for instance,
one POST endpoint might accept exactly a Registrar object, or a list
of Contact objects.
Co-Authored-By: gbrodman <gbrodman@google.com>
This includes two changes:
1. Creating a base string-type adapter for use parsing to/from JSON
classes that are represented as simple strings
2. Changing the object-provider methods so that the POST bodies should
contain precisely the expected object(s) and nothing else. This way,
it's easier for the frontend and backend to agree that, for instance,
one POST endpoint might accept exactly a Registrar object, or a list
of Contact objects.
2.25.0 contains a breaking change that made HttpStorageOptions not
serializeable, which breaks RDE as it needs to access GCS from Beam.
2.22.6 was the last version that was used before the Gradle upgrade.
Also had to downgrade google-cloud-nio to pass the tests.
For some inexplicable reason, I had to manually add
guava-listenablefuture as
testRuntimeClasspath/runtimeClasspath/deploy_jar dependencies to the
networking, docs and prober subprojects' lock files, as running
`gradle test --write-locks` would NOT add them and succeed; but without
`--write-locks`, running the corresponding tests would fail.
See: b/294378137.
Use a system property to specify whether this check should be executed.
We will update the presubmit test script to run this check only during
foss-pr.
Add placeholder configs for sending emails using Gmail in GSuite.
The names of the new configs are temporary. After migration they
will revert to the names currently in use by the AppEngine email API.
The instance ID used to be uniquely determined by App Engine SDK. Since
we no longer calls the SDK, we need a way to distinguish instances so
that their metrics would not stump on each other and result in a time
inversion error (as we have seen frequently in the logs since the
removal of the App Engine SDK).
This includes removing (hopefully temporarily) the gradle-lint plugin as
it is incompatible with various Gradle versions (see
https://github.com/nebula-plugins/gradle-lint-plugin/issues/393). This
is somewhat unfortunate since the plugin is useful for removing unused
dependencies, though with the relatively small amount of Gradle code we
write hopefully it will not be missed much. If Nebula changes their
code to be compatible with Gradle 8+, we can re-add it easily.
This upgrade means we can remove the code added in 342051e1.
* Use Jackson to create and read Tld YAML files
* Add getObjectMapper to TldYamlUtils
* revert lockfiles
* Fix optionals
* Add more tests and javadocs
* small fixes
This is part of the spec in RFC 5730 that we hadn't implemented until now. Note
that this requires changing LoginFlow to be transactional, but I don't think
that should cause any issues.
It was only called in one place (in actual production code), and it was just
slightly obscuring the fact that re-saves can be scheduled for multiple points
in the future in a way that wasn't amazingly helpful to understanding of the
system logic at the callsite.
This includes two changes, the second necessary for testing the first.
1. We add the rdap-queries field as mandated by the amendment to the
registry agreement,
https://itp.cdn.icann.org/en/files/registry-agreement/proposed-global-amendment-base-gtld-registry-agreement-12-04-2023-en.pdf.
This is fairly similar to the whois-queries field where we just query
the logs, but instead of searching for "whois" we search for "rdap".
2. BigQuery doesn't use MAX to refer to the bigger of two fields; MAX
accepts an array as an argument. In order to do what we want (and to
have the BigQuery statements succeed), we need to use GREATEST.
Tested both versions in alpha and production BigQuery instances.
This is part of b/247839944 as a followup to the large bug from
September 2022. As a result of that, there are domains whose
BillingRecurrence objects were closed but the domain wasn't deleted. In
order to avoid having these domains stick around forever without being
billed, we want to restart billing on them whenever their next billing
cycle would have been.
See b/290228682, there are edge cases in which the net_renew would be negative when
a domain is cancelled by superusers during renew grace period. The correct thing
to do is attribute the cancellation to the owning registrar, but that would require
changing the owing registrar of the the corresponding cancellation DomainHistory,
which has cascading effects that we don't want to deal with. As such we simply
floor the number here to zero to prevent any negative value from appearing, which
should have negligible impact as the edge cage happens very rarely, more specifically
when a cancellation happens during grace period by a registrar other than the the
owning one. All the numbers here should be positive to pass ICANN validation.
See b/248035435 for more details / reasoning, but basically this will
make it easier if we ever need to restore user actions in the future (or
figure out which user actions went wrong)
It was used by cron job and task queues, which now use OIDC-based auth.
Also renamed and consolidated auth enums to make them easier to
understand. Ultimately we should get rid of the AuthMethod part as OIDC
will be the only auth method used.
Based on the updated routing map:
Backend and tools: the only change is that INTERNAL is removed from allowed
auth methods. Should be an no-op.
Pubapi: INTERNAL is removed from allowed auth. For endpoints that only
allowed INTERNAL before, API and LEGACY become the allowed methods.
However this should not affect anything because regardless of which auth
method is ultimately used, the required auth level is always NONE for
pubapi endpoints. Therefore any auth result is discarded anyway.
Frontend: INTERNAL is removed. RegistryLockVerifyAction has lowered
its required auth level to NONE because it extends HtmlAction, which can
redirect the user to login if necessary. All other endpoints extending
HtmlAction require NONE, so it's better to keep things consistent.
Instead of using REDACTED FOR PRIVACY everywhere we should just include
the empty string (this is what the spec says, what other gTLD registrars
do, and what the RDAP conformance tool at
https://github.com/icann/rdap-conformance-tool says to do.
In the contact VCards, we omit redacted fields entirely unless the spec
requires that they exist (the version number and an empty 'fn' field).
This also applies to the "handle" field.
Eventually we will probably want to add the redaction extension but
that's not RFCed yet and isn't required for the August RDAP conformance
deadline.
It was previously calling toString() on an Optional<PremiumList> which was
unnecessarily verbose. The existing premium list is required to be present
anyway.
This is a follow-on to comments in PR #2037. It makes the main loop cleaner and
also removes ambiguities around database handling when the first query is run
with the cursor still empty because no results have been found yet.
This PR changes the two flavors of OIDC authentication mechanisms to
verify the same audience. This allows the same token to pass both
mechanisms. Previously the regular OIDC flavor uses the project id as
its required audience, which does not work for local user credentials
(such as ones used by the nomulus tool), which requires a valid OAuth
client ID as audience when minting the token (project id is NOT a valid
OAuth client ID).
I considered allowing multiple audiences, but the result is not as clean
as just using the same everywhere, because the fall-through logic would
have generated a lot of noises for failed attempts.
This PR also changes the client side to solely use OIDC token whenever
possible, including the proxy, cloud scheduler and cloud tasks. The nomulus
tool still uses OAuth access token by default because it requires USER level
authentication, which in turn requires us to fill the User table with objects
corresponding to the email address of everyone needing access to the tool.
TESTED=verified each client is able to make authenticated calls on QA with or
without IAP.
* Add an includeDeleted option to RefreshDnsForAllDomainsAction
* Add batching to the query
* Some refactoring
* Make batch size configurable
* Set status to ok
* Combine into one transaction
* Remove smear mintes parameter
* Only pass in lastInPreviousBatch
Add a --canary option (default to false) to the CurlCommand that allows
connection to the canary endpoints.
During canary analysis, only the DEFAULT-canary receives traffic. This
new flag allows use to test other canary services manually using the
curl command.
* Add Gmail Client and set up tests
Add a Gmail client and manually triggered email tests in
CannedScriptExecutionActon.
We will test Gmail with Google Workspace in Sandbox, since Alpha and
Crash are not properly set up for Google Workspace, and we have not
figured out why.
* Remove nested transaction from requestDnsRefresh
* Add a bulk version
* Remove transaction time as field
* Only add delay once
* have PublishDnsUpdatesAction use bulk refresh
The Java code will be added in a followup PR.
Also fixed tests failing due to org.json upgrade: decimal whole numbers
no longer have their fractional parts removed, so currency value strings
must end with ".00" instead of ".0".
When RdeReportAction is invoked without a prefix parameter (as in the
case when it is kicked off by cron jobs for potential catch ups), we
need to used the same heuristics that's employed in RdeUploadAction to
find the most recent prefix for the given watermark, otherwise the job
will not find any deposits to upload.
Also renamed RdeUtil to RdeUtils, to be consistent with our naming
conventions.
IAP and regular OIDC auth mechanisms are unified under a base class that
produces either APP or USER level AuthResult based on the principal email
found in the OIDC token.
Also moved some enum classes to better organize code structure.
This encompasses most of the basic information that is viewable in the
existing console, basically, just viewing the base info of the Registrar
object.
Use the ApplicationDefaultCredential annotation instead.
The new annotation has been verified in sandbox and production using the
'executeCannedScript' endpoint. The verification code is removed in this
PR too.
This adds a possible configuration point "defaultServiceAccount" (which
in GAE will be the standard GAE service account). If this is configured,
CloudTasksUtils can create tasks with standard HTTP requests with an
OIDC token corresponding to that service account, as opposed to using
the AppEngine-specific request methods.
This also works with IAP, in that if IAP is on and we specify the IAP
client ID in the config, CloudTasksUtils will use the IAP client ID as
the token audience and the request will successfully be passed through
the IAP layer.
Tetsted in QA.
This column is used by the billing team to create invoices. Registrars
have asked that a single invoice be created for a given registrar,
instead of one per registrar-tld pair. This should have no other effect
on the billing pipeline as the invoice grouping key has a description
field that also contains the TLD, so the granularity as a whole does not
change.
We have been using it as a poor man's timed flag that triggers a system
behavior change after a certain time. We have no foreseeable future use
for it now that the DNS pull queue related code is deleted. If in the
future a need for such a flag arises, we are better off implementing a
proper flag system than hijacking this class any way.
* Prepare switch of credential annotation
Prepare the switch from DefaultCredential to ApplicationCredential.
In nomulus tools, start using the new annotation. This is tested by
successfully using the nomulus curl command, which actually needs a
valid credential to work.
For remaining use cases of the old annotation in Nomulus server, add
some code that relies on the new credential to work. Once these code
are tested in sandbox and production, we will switch the annotations.
If the user does, e.g. `--allowed_nameservers=` (or contact ids) that
shouldn't mean a list consisting solely of the empty string.
Using this parameter / converter allows us to ensure that lists of
strings look reasonable.
This includes renaming the billing classes to match the SQL table names,
as well as splitting them out into their own separate top-level classes.
The rest of the changes are mostly renaming variables and comments etc.
We now use `BillingBase` as the name of the common billing superclass,
because one-time events are called BillingEvents
* Allow rotation when updating registrar cert
When updating a registrar's primary cert, add a flag to activate
rotation of previous primary cert to failover.
This functionality is part of the prober ssl cert renewal automation.
The only method that is called from this class is setNumInstances. However we
don't current use `nomulus set_num_instances` anywhere. If we need to change
the number of instances, it is either done by updating appengine-web.xml, which
is deployed by Spinnaker, or doing it manually as a break-glass fix via gcloud
or on Pantheon.
Because we need to check if a contact history is the most recent for its
underlying contact resource, the query-wipe out-repeat loop no longer works
ideally due to the added overhead with the query.
Instead, we refactor the logic into a Beam pipeline where the query only
needs to be performed once and history entries eligible for wipe out are
handled individually in their own transforms. Because history entries
are otherwise immutable, we can run the pipeline in relatively relaxed
repeatable read isolation level. We also do not worry about batching for
performance, as we do not anticipate this operation to put a lot of
strains on the particular table.
This also adds README files to explain the two different IDN table locations
(which have different purposes). See http://b/278565478 for more information.
This includes changes to make sure that we use the proper per-TLD IDN
tables as well as setting/updating/removing them via the Create/Update
TLD commands.
There can be situations (anchor tenants, test tokens, other ways of
getting a domain to cost $0) where we may want to delete a domain during
the add grace period but the credit applied is 0. We should not fail on
those cases.
See b/277115241 for an example.
This new table has just been approved by ICANN. It is the same as our existing
Extended Latin table, except with the removal of some lesser-used characters
with diacritic marks that are confusable variants.
The filenames for the IDN tables are made explicit to improve code readability.
And this reverses the removal of G with stroke from the existing Extended Latin
table (see PR #1938), so that that table continues to accurately reflect the
state of our previously launched TLDs.
This is the full list of removed characters:
U+00E1 # LATIN SMALL LETTER A WITH ACUTE
U+0101 # LATIN SMALL LETTER A WITH MACRON
U+01CE # LATIN SMALL LETTER A WITH CARON
U+010B # LATIN SMALL LETTER C WITH DOT ABOVE
U+01E7 # LATIN SMALL LETTER G WITH CARON
U+0123 # LATIN SMALL LETTER G WITH CEDILLA
U+01E5 # LATIN SMALL LETTER G WITH STROKE
U+0131 # LATIN SMALL LETTER DOTLESS I
U+00ED # LATIN SMALL LETTER I WITH ACUTE
U+00EF # LATIN SMALL LETTER I WITH DIAERESIS
U+01D0 # LATIN SMALL LETTER I WITH CARON
U+0144 # LATIN SMALL LETTER N WITH ACUTE
U+014B # LATIN SMALL LETTER ENG
U+00F3 # LATIN SMALL LETTER O WITH ACUTE
U+014D # LATIN SMALL LETTER O WITH MACRON
U+01D2 # LATIN SMALL LETTER O WITH CARON
U+0157 # LATIN SMALL LETTER R WITH CEDILLA
U+0163 # LATIN SMALL LETTER T WITH CEDILLA
U+00FA # LATIN SMALL LETTER U WITH ACUTE
U+00FC # LATIN SMALL LETTER U WITH DIAERESIS
U+01D4 # LATIN SMALL LETTER U WITH CARON
U+1E83 # LATIN SMALL LETTER W WITH ACUTE
U+1E81 # LATIN SMALL LETTER W WITH GRAVE
U+1E85 # LATIN SMALL LETTER W WITH DIAERESIS
U+1EF3 # LATIN SMALL LETTER Y WITH GRAVE
U+017C # LATIN SMALL LETTER Z WITH DOT ABOVE
Makes the next run at the first Monday of December, which should give us
plenty of time to fix the issue with it wiping out PII in the most recent
contact history.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1982)
<!-- Reviewable:end -->
* Check allowedEppActions when validating tokens
* Reflect failed tokens in the fee check
* Add tests for domainCheckFlow
* Add hyphens to fee class name
* Add clarifying comment to catch block
* Add specific exception types
* Implement default tokens for the fee extension in domain check
* Add test for expired token
* Add test for alloc token and default token
* Fix formatting
* Always check for default tokens
* Change transaction time to passed in DateTime
Also adds a DnsUtils class to deal with adding, polling, and removing
DNS refresh requests (only adding is implemented for now). The class
also takes care of choosing which mechanism to use (pull queue vs. SQL)
based on the current time and the database migration schedule map.
When submitting tasks to Cloud Tasks, we will use the built-in OIDC
authentication which runs under the default service account (not the
cloud scheduler service account). We want either to work for app-level
auth.
This does nothing for now, but in the future this will allow us to refer
to the RegistryConfig and/or Service objects from the core project. This
will be necessary when changing CloudTasksUtils to not use the AppEngine
built-in connection (it will need to use a standard HTTP request
instead).
It seems like we're hitting App Engine capacity issues resulting in actual pages
now (for whatever reason, but likely one customer), and we obviously don't want
that.
The value of the column would be set to START_OF_TIME for new entries.
Every time a row is read, the value is updated to the current time. This
allows concurrent reads to not repeatedly read the same entry that has the
earliest request time, because they would only look for rows that have a value
of process time that is before current time - some padding time.
This basically fulfills the same function that LEASE_PADDING gives us
when using a pull queue, whereas a task would be leased for a certain
time, during which time they would not be leased by anyone else.
See: https://cs.opensource.google/nomulus/nomulus/+/master:core/src/main/java/google/registry/dns/ReadDnsQueueAction.java;l=99?q=readdnsqueue&ss=nomulus%2Fnomulus
This PR adds an alternative method to upload Lordn to Nordn server without
using App Engine pull queue. A new database migration stage is added to control
whether a new task is scheduled with the old or new method. The
NordnUploadAction is configured to process both kind of tasks. Once the tasks
scheduled for the old tasks are all processed, we can start using the
new method exclusively.
See: go/registry-pull-queue-redesign
There is no point comparing the old CRL to the new ones when the old one
is invalid. This could happen when the CA cert rotates, after which the
old CRL stop being valid as it fails signature verification against the
new cert.
This change will allow us to keep updating the CRL after a CA rotation without
having to manually delete the old CRL from the database.
See b/270983553.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1946)
<!-- Reviewable:end -->
ICANN doesn't like this character because it's confusable with a normal G (the
stroke tends to get lost in the visual clutter of the descender), and .com's
Extended Latin table doesn't use it either. Best to get rid of it.
For some reason `./gradlew clean build` on master is failing for me on
multiple machines due to a new org.json:json version triggering license
violations, even though the lock files are not changing.
Note that the old versions are still present because if I remove
"The JSON license", which the old versions use, the check also fails...
We might (likely will) modify some of the fiddly bits around this (maybe
the GSON serialization, where we do the actual authorization, etc) but
this should be a decent basic shell structure for endpoints that the new
registrar console can call to retrieve JSON results.
This is only generated and used if "iapClientId" is set in the proxy
config. If so, we use code similar to
https://cloud.google.com/iap/docs/authentication-howto#obtaining_an_oidc_token_for_the_default_service_account
to generate an ID token that is valid for IAP. We set the token on the
Proxy-Authorization header so that we can keep using the pre-existing
access token as well -- IAP allows for us to use either the
Authorization header or the Proxy-Authorization header.
We were under the mistaken impression before that there was a reliable
way to, out-of-band, get a GAIA ID for a particular email address.
Unfortunately, that isn't the case (at least, not in a scalable way or
one that support agents could use). As a result, we have to allow null
GAIA IDs in the database.
When we (or the support team) create new users, we will only specify the
email address and not the GAIA ID. Then, when the user logs in for the
first time, we will have the GAIA ID from the provided ID token, and we
can populate it then.
See b/260945047.
Also refactored the corresponding tests, which should future updates easier.
This change should be deployed at or around 2023-02-15T16:00:00Z.
* Modify DomainCreateFlow to check for an applicable defaultPromoToken
* Add handling for deleted tokens
* Change cache to allocation token cache
* Abstract away cache methods
* Use AllocationToken.getAll in create flow
* Filter out empty tokens
This is an 'easy' upgrade that requires a minor change in
common/build.gradle and the removal of an unnecessary import in buildSrc.
Gradle 7.4 and above has breaking changes that break the latest nebula lint plugin. We may have to wait a while.
Due to the way the beam pipeline is designed, it will expand an
recurring billing event when its event time is in scope for expansion,
instead of billing time. This means that the one time will be generated
45 days earlier. This would negate the need to check if the expansion is
finished when generating monthly invoices.
We will need to backfill the past 45 days of onetimes before the new
code is deployed. As an illustration, with the old code, a cursor time
of 2023-01-17 means that all auto-renewals whose billing time is before
2023-01-17 were created, which corresponds to an effective cursor time
of 2022-12-03 (45 days before 2023-01-17) for event time. This cursor
will need to be brought to 2023-01-17 to ensure that there is no gap in
generated event times when switching to use the new code.
Currently we synthesize a instance id which requires the use of App
Engine Module API. Given that we only have one version of code running
at one time, and HTTP is stateless, there is no point tracking exactly
which GAE "instance" is. We do lose information on which service (default,
backend, etc) is writing the metric, but that does not seem very
important.
Using a constant fake instance ID allows us to get rid of another GAE
dependency.
The temporary queue.xml file is not deleted in the afterEach() method,
likely causing some flaky tests that we saw due to overwriting of the
file by concurrent tests.
Most of its usage can be replaced by JpaIntegrationTestExtension. In
places where specific GAE APIs are still needed, namely when pull queue
or the User service is used, two simplifed extensions are used, which
makes them much easier to identify when the APIs are no longer used.
We've been using it in production for three weeks now. Everything seems
to be working fine. Removing the code related to checking the migration
state and using the override.
This will replace the ExpandRecurringBillingEventsAction, which has a
couple of issues:
1) The action starts with too many Recurrings that are later filtered out
because their expanded OneTimes are not actually in scope. This is due
to the Recurrings not recording its latest expanded event time, and
therefore many Recurrings that are not yet due for renewal get included
in the initial query.
2) The action works in sequence, which exacerbated the issue in 1) and
makes it very slow to run if the window of operation is wider than
one day, which in turn makes it impossible to run any catch-up
expansions with any significant gap to fill.
3) The action only expands the recurrence when the billing times because
due, but most of its logic works on event time, which is 45 days
before billing time, making the code hard to reason about and
error-prone. This has led to b/258822640 where a premature
optimization intended to fix 1) caused some autorenwals to not be
expanded correctly when subsequent manual renews within the autorenew
grace period closed the original recurrece.
As a result, the new pipeline addresses the above issues in the
following way:
1) Update the recurrenceLastExpansion field on the Recurring when a new
expansion occurs, and narrow down the Recurrings in scope for
expansion by only looking for the ones that have not been expanded for
more than a year.
2) Make it a Beam pipeline so expansions can happen in parallel. The
Recurrings are grouped into batches in order to not overwhelm the
database with writes for each expansion.
3) Create new expansions when the event time, as opposed to billing
time, is within the operation window. This streamlines the logic and
makes it clearer and easier to reason about. This also aligns with
how other (cancelllable) operations for which there are accompanying
grace periods are handled, when the corresponding data is always
speculatively created at event time. Lastly, doing this negates the
need to check if the expansion has finished running before generating
the monthly invoices, because the billing events are now created not
just-in-time, but 45 days in advance.
Note that this PR only adds the pipeline. It does not switch the default
behavior to using the pipeline, which is still done by
ExpandRecurringBillingEventsAction. We will first use this pipeline to
generate missing billing events and domain histories caused by
b/258822640. This also allows us to test it in production, as it
backfills data that will not affect ongoing invoice generation. If
anything goes wrong, we can always delete the generated billing events
and domain histories, based on the unique "reason" in them.
This pipeline can only run after we switch to use SQL sequence based ID
allocation, introduced in #1831.
We can use the saved refresh token associated with the nomulus tool to
request an ID token with an audience of the IAP client in order to
satisfy IAP with with the Nomulus tool.
Note: this requires that the user of the Nomulus tool, e.g.
"gbrodman@google.com" has a User object stored in SQL.
Tested on QA
This is similar to where we store the SQL files for beam pipelines, and
frankly makes more sense. Also streamlined the use of the API to read
SQL files from a jar.
The AppEngine thread factory is only useful if we can't create our own
(this is no longer the case) or if we need access to AppEngine APIs
(this is no longer the case).
The Concurrent class is only used by the DNS writer and the
CreateGroupsAction.
This parameter is misleading and does not do what it purports to do.
Namely, it does not impact the level of parallelism. Given the input n for this
parameter, and m for the batch size, the elements are divided (keyed) into n
groups, each of which are then spread evenly across all threads, which
are eventually in turn batched into batches with size m:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L227
This is also evident in the implementation itself, where the ShardedKey
is determined by the unique number for a worker/thread combo and the
original key:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L268
Using a more concrete example, suppose we have 100 elements and 10
worker threads, with a target batch size of 5. If the "shard" number is set to
1, we first spread the 100 elements across 10 threads, resulting in 10
elements per thread, each thread then batches the elements into 2
batches of size 5.
If the "shard" number is set to 2, the 100 elements are first divided into 2
"shards" of 50 each. Each "shard" is then distributed within the 10
threads, resulting in 5 elements per "shard" per thread. They then get
turned into 1 batch per "shard" per thread. In the end, each thread
still processes 2 batches, even though they are from 2 different "shards".
Therefore this "shard" number does not perform horizontal partitioning
that one normally associates with sharding, and provides no
performance benefits but rather confuses the user.
It is also suggested that using withShardedKey() alone is already
sufficient to achieve auto-sharding within the keyed group. There is no
need to manually divide the input by keying them differently based on
the "shard" number specified:
https://youtu.be/jses0W4Zalc?t=967
Some retriers are no longer needed because transactions are
automatically retried by the JPA transaction manager when there's a
transient exception.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1874)
<!-- Reviewable:end -->
We no longer use App Engine Remote API as of #1858.
The pipeline endpoint is only for GAE mapreduce, which we stopped doing
for a while.
TESTED=deployed to alpha and used nomulus tool built from master to
connect to the tools service.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1873)
<!-- Reviewable:end -->
* Add defaultPromoTokens to Registry
* Remove flyway files from this PR
* Fix merge conflicts
* Add back flyway file
* Add more info to error messages
* Change to a list
* Fix javadoc
* Change error message
* Add note to field declaration
These were always properly reflected in the actual creations, but when
running a check flow it would still show a non-zero cost even when using
an ANCHOR_TENANT allocation token. This changes it so that we accurately
show the $0.00 cost.
This is only used for contacting Datastore. With the removal of:
1. All standard usages of Datastore
2. Usage of Datastore for allocation of object IDs
3. Usage of Datastore for GAE user IDs
we can remove the remote API without affecting functionality.
This also allows us to just use SQL for every command since it's lazily
supplied. This simplifies the SQL setup and means that we remove a
possible situation where we forget the SQL setup.
So long, farewell, adios, ciao, sayonara, 再见!
TESTED=deployed to alpha and used `nomulus list_tlds` to confirm that the web app can receive and serve requests.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1863)
<!-- Reviewable:end -->
With newer versions of Java 11, javadoc fails to build due to unknown
tags in package-info.java files. These files are not important so we
exclude them.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1866)
<!-- Reviewable:end -->
Switch to using the login email address instead of GAE user ID to
identify console users. The primary use cases are:
1) When the user logged in the registrar console, need to figure out
which registrars they have access to (in
AuthenticatedReigstrarAccess).
2) When a user tries to apply a registry lock, needs to know if they
can (in RegistryLockGetAction).
Both cases are tested in alpha with a personal email address to ensure
it does not get the permission due to being a GAE admin account.
Also verified that the soy templates includes the hidden login email
form field instead of GAE user ID when registrars are displayed on the
console; and consequently when a contact update is posted to the server,
the login email is part of the JSON payload. Even though it does not
look like it is used in any way by RegistrarSettingsAction, which
receives the POST request. Like GAE user ID, the field is hidden, so
cannot be changed by the user from the console, it is also not used to
identify the RegistryPoc entity, whose composite keys are the contact
email and the registrar ID associated with it.
The login email address is backfilled for all RegistrarPocs that have a
non-null GAE user ID. The backfilled addresses converted to the same ID
as stored in the database.
It doesn't actually use any App Engine libraries or code -- it's just a
generic connection with authentication to a service. This also involves
changing that block of config to be "gcpProject" instead of "appEngine"
since it's more generic.
Note: this will require an internal PR as well to change the
corresponding private config block
* Remove package token on manual transfer approval
* remove extra variables
* Add back white space
* Don't overwrite existingDomain
* Format fixes, use available helper variables
* Use PACKAGE allocation tokens in tests
* Refactor
* Fix merge conflicts
* Dont overwrite existingRecurring
* Remove names from packages on automatic transfers
* Add more tests
* Remove unneccesary local variable
* Eliminate unnecessary api call
* Reformat if blocks
* Don't overwrite existingRecurring
This PR fixes the issue where the onetime billing event for an autorenew
is not correctly created if the recurrence of the autorenew is closed
during the autorenew grace period, such as the case if a manual renew
happens during the same grace period.
The detailed analysis of the issue is captured in b/258822640. Note that
this is a quick and dirty fix to make ongoing billing event expanse work
correctly in the future. It does not fix the missing events in the past,
nor can it be used to reconstruct the missing ones (by providing a
different cursor time), due to timeout when triggering the action from
nomulus curl.
Per Weimin, the recurrences that fits the new condition along, based on
the current cursor, would increase from 382k to 430k, a 12% increase. I
checked last nights cron job run, which starts on 22:00 EST and seemed
to finish at 22:15 EST (when the last log for this request was
recorded), so it should definitely still finish in time for the nightly
runs with the new condition.
This is not a complete removal of ofy as we still a dependency on it
(GaeUserIdConverter). But this PR removed it from a lot of places where
it's no longer needed.
* Fix Gradle dependency version pinning
In Gradle 7, version labels require '!!' at the end to be free from
any forced upgrade.
Hibernate min version needs to be advanced past 5.6.12, which is buggy.
Upgraded most dependencies to the latest version.
Therefore this PR also removed several classes and related tests that
support the setup and verification of Ofy entities.
In addition, support for creating a VKey from a string is limited to
VKey<? extends EppResource> only because it is the only use case (to
pass a key to an EPP resource in a web safe way to facilitate resave),
and we do not want to keep an extra simple name to class mapping, in
addition to what persistence.xml contains. I looked into using
PersistenceXmlUtility to obtain the mapping, but the xml file contains
classes with the same simple name (namely OneTime from both PollMessage
and BillingEvent). It doesn't seem like a worthwhile investment to write
more code to deal with that, when the fact is that we only need to
consider EppResource.
* Remove Cloud KMS from Nomulus Server
Removed Cloud KMS from the Nomulus (:core) since it is no longer used.
Renamed remaining classes to reflect their use of the SecretManager.
Updated the config instructions to use a new codename for the keyring:
KMS to CSM. This PR works with both codenames. Will drop 'KMS' after
the internal repo is updated.
Add a implementation of Delegated credential without using downloaded private key.
This is a stop-gap implementation while waiting for a solution from the Java auth library.
Also added a verifier action to test the new credential in production. Testing is helpful because:
Configuration is per-environment, therefore, success in alpha does not fully validate prod.
The relevant use case is triggered by low-frequency activities. Problem may not pop out for hours or longer.
This PR removes all Ofy related cruft around `HistoryEntry` and its three subclasses in order to support dual-write to datastore and SQL. The class structure was refactored to take advantage of inheritance to reduce code duplication and improve clarity.
Note that for the embedded EPP resources, either their columns are all empty (for pre-3.0 entities imported into SQL), including their unique foreign key (domain name, host name, contact id) and the update timestamp; or they are filled as expected (for entities that were written since dual writing was implemented).
Therefore the check for foreign key column nullness in the various `@PostLoad` methods in the original code is an no-op as the EPP resource would have been loaded as null. In another word, there is no case where the update timestamp is null but other columns are not.
See the following query for the most recent entries in each table where the foreign key column or the update timestamp are null -- they are the same.
```
[I]postgres=> select MAX(history_modification_time) from "DomainHistory" where update_timestamp is null;
max
----------------------------
2021-09-27 15:56:52.502+00
(1 row)
[I]postgres=> select MAX(history_modification_time) from "DomainHistory" where domain_name is null;
max
----------------------------
2021-09-27 15:56:52.502+00
(1 row)
[I]postgres=> select MAX(history_modification_time) from "ContactHistory" where update_timestamp is null;
max
----------------------------
2021-09-27 15:56:04.311+00
(1 row)
[I]postgres=> select MAX(history_modification_time) from "ContactHistory" where contact_id is null;
max
----------------------------
2021-09-27 15:56:04.311+00
(1 row)
[I]postgres=> select MAX(history_modification_time) from "HostHistory" where update_timestamp is null;
max
----------------------------
2021-09-27 15:52:16.517+00
(1 row)
[I]postgres=> select MAX(history_modification_time) from "HostHistory" where host_name is null;
max
----------------------------
2021-09-27 15:52:16.517+00
(1 row)
```
These alternative ORMs are introduced as a way to make querying large number of
domains and domain histories more efficient through bulk loading from several
to-be-joined tables separately, then in-memory re-assembly of the final entity,
bypassing the need to query multiple tables each time an entity is queried.
Their primary use case is loading these entities for comparison between
datastore and SQL during the migration, which has been completed. The
code remain unused as of now and their existence makes refactoring and
general maintenance more complicated than necessary due to the need to keep
them up to date.
Therefore we remove the related code.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1834)
<!-- Reviewable:end -->
Also removed the ability to disable update timestamp auto update as it
was only needed during the migration.
Lastly, rectified the use of raw Coder in RegistryJpaIO.
Passing in an already-existing instance is an antipattern because it can
lead to race conditions where something else modified the object in
between when the pipeline loaded it and when you're saving it. The Write
action should only be writing new entities.
We cannot check IDs for the objects (some IDs are not autogenerated so
they might exist already). We also cannot call `insert` on the objects
because the underlying JPA `persist` call adds the input object to the
persistence context, meaning that any modifications (e.g.
updateTimestamp) are reflected in the input object. Beam doesn't allow
modification of input objects.
* Does not self allocate IDs in Beam by default.
Per b/250948425, it is dangerous to implicitly allow all Beam pipelines
to create buildables by self allocating the IDs. This change makes it so
that one has to explicitly request self allocation in Beam.
A boolean is added to the pipeline option so that it can be passed to
the beam worker initializer that controls the behavior of the JVM on
each worker. Note that we did not add the option in the metadata.json file
because we did not want people to use the override at run time when launching
a pipeline, due to the risk. As shown in RdePipeline.java, we instead
explicitly hard-code the option in the pipeline. There is nothing that
stops one to supply that option when launching the pipeline, but it's
not advised.
Tested=deployed the pipeline alpha and ran it.
* Properly create and use default credential
This PR consists of the following changes:
- Stopped adding scopes to the default credential when using it to access other
non-workspace GCP APIs. Scopes are not needed here.
- Started applying scopes to the default credential when using to access
Drive and Sheets APIs.
- Upgraded Drive access from the deprecated credential lib to the
up-to-date one
- Switched Sheet access from the exported json credential to the
scoped default credential.
This PR requires that the affected files be writable to the default
service account (project-name@appspot.gserviceaccount.com) of the
project.
- This is already the case for exported files (premium terms, reserved
terms, and domain list).
- The registrar sync sheets in alpha, sandbox, and production have been
updated with the new permissions.
All impacted operations have been tested in alpha.
* Properly create and use default credential
This PR consists of the following changes:
- Added a new method to generate scope-less default credential when using it to
access other non-workspace GCP APIs. Scopes are not needed here.
- Started to use the new credential in the SecreteManager.
- Will migrate other usages to this new credential gradually.
- Marked the old DefaultCredential as deprecated.
- Started applying scopes to the default credential when using to access Drive
and Sheets APIs.
- Upgraded Drive access from the deprecated credentials lib
- Switched Sheet access from the exported json credential to the scoped
default credential.
This PR requires that the affected files be writable to the default service
account (project-name@appspot.gserviceaccount.com) of the project.
- This is already the case for exported files (premium terms, reserved terms,
and domain list).
- The registrar sync sheets in alpha, sandbox, and production have been
updated with the new permissions.
All impacted operations have been tested in alpha.
This reverts commit 1ab077d267.
Apparently the new version of Spinnaker that is compatible with this doesn't
work for our release, so we need to roll this back for now. (Again!)
We started getting failures because some of the tests used October. In
general we should freeze the clock for testing as much as possible.
Same thing with the Get*Commands
Basically, any time we're loading a bunch of linked objects that might
change, we want to have REPEATABLE_READ so that another transaction
doesn't come along and smush whatever we think we're loading.
The following instances of READ_COMMITTED haven't changed:
- RdePipeline (it only loads immutable objects like histories)
- Invoicing pipeline (only immutable objects like BillingEvents)
- Spec11 (doesn't use any linked info from Domain)
This also changes the PersistenceModule to use REPEATABLE_READ by
default on the replica JPA TM, for the standard reasoning.
This runs over all domains that weren't deleted as of September 5. This
will fix most of b/248112997, which is itself caused by b/245940594 --
creating synthetic history objects means that the RDE pipeline will look
at those instead of the potentially-no-longer-valid data in the old
history objects.
Also fixed a bug introduced in #1785 where identity checked were performed instead of equality. This resulted in two sets containing the same elements not being regarded as equal and subsequent DNS updated being unnecessarily enqueued.
The old class is modeled after datastore with some logic jammed in for it to work with SQL as well. As of #1777, the ofy related logic is deleted, however the general structure of the class remained datastore oriented.
This PR refactors the existing class into a ForeignKeyUtils helper class that does away wit the index subclasses and provides static helper methods to do the same, in a SQL-idiomatic fashion.
Some minor changes are made to the EPP resource classes to make it possible to create them in a SQL only environment in tests.
`TransferData` is currently a generic class with a complicated type parameter that designate the `Builder` class of its concrete subclass, on order to facilitate returning the said `Builder` from an instance loosely typed to the superclass (`TransferData`) itself.
While this works, in most all places that a `TransferData` is used, the raw, un-generic type is declared, resulting a lot of warnings, not to mention the fact that type safety not actually checked when raw type is used.
In this PR, we make it so that the concrete `Builder` is returned through a protected abstract method that is implemented by the subclasses. The type information therefore no longer needs to be embedded in the superclass type signature, and reflection is not necessary to create the `Builder` either. Overall, it makes `TransferData` a much cleaner class without the messiness of generics.
This uses the GoogleIdTokenVerifier to verify ID tokens passed in
(presumably from a front end) via cookies. This isn't used anywhere yet
but it will be used for front-end API calls for the new console.
First, we create a sequence of User IDs in Postgres and assign it to the
User ID field, meaning that Hibernate can autogenerate IDs.
Next, add an update timestamp.
Next, add a constraint that we can't have multiple Users with the same
email address.
Finally, create a DAO since we'll usually want to query by that email
address (at least for now).
FKI used to be persisted in datastore to help speed up loading by foreign key.
Now it is just a helper class to do the same thing in SQL because
indexing is natively supported in SQL.
- Create a sequence to generate IDs for the user (this allows us to have
Long ID types so that Hibernate can autogenerate IDs)
- Add an update timestamp column so we can extend BackupGroupRoot
- Add a restriction that there can't be multiple users with the same
email address
* Rename ContactResource -> Contact
This is a follow-up to PR #1725 and #1733. Now all EPP resource entity class
names have been rationalized to match with their SQL table names.
This means that LegacyAuthenticationMechanism or a to-be-created
OAuth2AuthenticationMechanism) can return a UserAuthInfo object that
contains either the GAE User or the console User as appropriate. The
goal is that the non-auth flows shouldn't have to know about which user
type it is. Note: the registry lock flow (for now) needs to know about
the separate types of auth because it is a separate level of auth from
the standard AuthenticatedRegistrarAccessor.
The AuthenticatedRegistrarAccessor code is a bit odd because the new
role system doesn't quite fit neatly into the old registrar ->
OWNER,ADMIN system but this is a fine approximation. Basically, any
new registrar role will map to the old OWNER role.
* Add the PackagePromotion table
* Add long id
* Add NOT NULL
* fix formatting
* make package price non null
* Add not nulls to java file
* Fix broken tests from merge conflicts
Also deletes the autorenew poll message history revision id field in
Domain, which is only needed to recreate the ofy key for the poll
message. The column already contains null values in it, making it
impossible to depend on it. The column itself will be deleted from the
schema after this PR is deployed.
The logic to update autorenew recurrence end time is changed
accordingly: When a poll message already exists, we simply update the
endtime, but when it no longer exists, i. e. when it's deleted
speculatively after a transfer request, we recreate one using the
history entry id that resulted in its creation (e. g. cancelled or rejected
transfer).
This should fix b/240984498. Though the exact reason for that bug is
still unclear to me. Namely, it throws an NPE at this line during an
explicit domain transfer approval:
https://cs.opensource.google/nomulus/nomulus/+/master:core/src/main/java/google/registry/flows/domain/DomainFlowUtils.java;l=603;bpv=1;bpt=0;drc=ede919d7dcdb7f209b074563b3d449ebee19118a
The domain in question has a null autorenewPollMessageHistoryId, but
that in itself should not have caused an NPE because we are not
operating on the null pointer. On that line the only possible way to
throw an NPE is for the domain itself to be null, but if that were the
case, the NPE would have been thrown at line 599 where we called a
method on the domain object.
Regardless of the cause, with this PR we are using an explicitly
provided history id and checking for its nullness before using it. If a
similar issue arises again, we should have a better idea why.
Lastly, the way poll message id is constructed is largely simplified in
PollMessageExternalKeyConverter as a result of the removal ofy parent
keys in PollMessage. This does present a possibility of failure when
immediately before deployment, a registrar requests a poll message and
received the old id, but by the time the registrar acks the id, the new
version is deployed and therefore does not recognize the old key. The
likelihood of this happening should be slim, and we could have prevented
it by letting the converter recognize both the old and the new key.
However, we would like to eventually phase out the old key, and in
theory a registrar could ack a poll message at any time after it was
requested. So, there is not a safe time by which all the old ids are
acked, lest we develop some elaborate scheme to keep track of which
messages were sent with an old id when requested and which of these old
ids are acked. Only then can we be truly safe to phase out the old id.
The benefit does not seem to warrant the effort. If a registrar does
encounter a situation like this, they could open a support bug to have
us manually ack the poll message for them.
Basically, what's happening here is that some flow tests are adding
things to the claims list cache which is stored statically, meaning that
some other tests can pick those up when they shouldn't. By adding the
extension in RFTC, it'll clear out the caches after each test.
* Allow anchor tenant creation via allocation token behavior
This also enforces that non-superusers cannot create registrations on
trademarked names prior to the sunrise period, even if they have an
allocation token with ANCHOR_TENANT behavior.
* Create a registry lock permission and corresponding account manager role
This allows us to distinguish between standard account managers and
users that might have the registry lock permission. This will make the
registry lock password-setting flow easier (user can reset their
password iff they have the REGISTRY_LOCK permission, instead of having a
separate boolean) and allows us to easily determine whether or not a
user should have access to registry lock views in the UI.
We cache the ClaimsList Java object for six hours (we don't expect it to
change frequently, and the cron job to update it only runs every twelve
hours). Subsequent calls to ClaimsListDao::get will return the cached
value.
Within the ClaimsList Java object itself, we cache any labels that we
have retrieved. While we already have a form of a cache here in the
"labelsToKeys" map, that only handles situations where we've loaded the
entire map from the database. We want to have a non-guaranteed cache in
order to make repeated calls to getClaimKey fast.
* Use the new renewal price logic in transfer flow
* Fix build
* Add renewal handling on all transfer flows
* Merge branch 'master' into transfer-retain-renewal-price
* Merge branch 'master' into transfer-retain-renewal-price
* Add more tests
* Add base object classes for new user/role permissioning model
- Adds the permissions themselves
- Adds the six roles that a user may have -- three global, three
per-registrar
- Adds the mapping from role -> set of permissions
- Adds a UserRoles object to encapsulate the answer to the question of
"does this user have this permission?"
- Adds a User class as a base to show how we will use the new UserRoles
object
longer included in the generated WAR, even though the deploy_jar
configuration is specified as a dependency.
I could not figure out a way to tweak the configuration dependency to
have core.jar pulled into the .war, so I decided to just explicitly pick
it from its known location.
TESTED=deployed to alpha and verified that the instances can start.
* Rename DomainContent -> DomainBase
This is a follow-up to PR #1725 which renamed DomainBase to Domain. Now, the
class naming hierarchy has the same structure as ContactBase/HostBase.
* Rename DomainBase -> Domain
This was a long time coming, but we couldn't do it until we left Datastore, as
the Java class name has to match the Datastore entity name.
Subsequent PRs will rename ContactResource to Contact and HostResource to Host,
so that everything matches the SQL table names (and is shorter!).
* Merge branch 'master' into rename-domainbase
The fix is based on b/240627423. I tested locally and was able to build
with the -PenableCrossReferencing=true flag successfully.
TESTED=run the kythe GCB pipeline locally.
This PR turns out to be more massive than I would have liked but it
deals with all billing event related stuff, which are more or link all
intertwined:
* Remove all billing events as Ofy entities.
* Add a temporary annotation to allow BillingEvent's ID to be
auto-allocated by ofy while not lacking the Ofy @Id annotation.
* Remove Modification, which is only used in ofy.
* Remove BillingVKey, as we do not need to store the ofy key parent
information anymore. The VKey for a billing event now just contain
its primary key, and can be converted by VKeyConverter.
* Remove BigQuery related code in the billing pipeline.
Note that after BillingVKey is removed, several columns in
BillingCancellation are no longer needed. The change to database schema
will be handled in https://github.com/google/nomulus/pull/1721 after
this PR is deployed to production.
<!-- Reviewable:start -->
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1710)
<!-- Reviewable:end -->
* Revert "Upgrade App Engine Standard to Java 17 w/ bundled APIs (#1714)"
This partially reverts commit d8e77e2ab2 (it keeps
intact unrelated version upgrades).
We need to temporarily revert this because Spinnaker isn't quite yet playing
nice with the new <app-engine-apis> configuration option in appengine-web.xml
(it seems like this was added recently and Spinnaker is still stuck on App
Engine SDK version 1.9.82 which predates it). Hopefully we can get that
dependency updated in Spinnaker soon and then we can re-upgrade to Java 17.
* Flyway files for adding allocationToken column to Domain table
* Rename column to current_package_token
* Update er diagram
* Add foreign key to DomainHistory
This shouldn't matter for billing or anything like that because the
actual actions performed that month are still correct, but before this
PR we're including all domains ever created in the total_domains number,
including deleted domains
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1713)
<!-- Reviewable:end -->
* Upgrade App Engine Standard to Java 17 w/ bundled APIs
Note that this doesn't yet upgrade our actual Gradle scripts to use a more
recent of Java (that will happen separately); this solely affects the GAE
instances.
I followed the instructions here:
https://cloud.google.com/appengine/docs/standard/java-gen2/services/access
And note that I removed threadsafe true from appengine's XML config because
that doesn't do anything anymore and was just throwing errors (the new
instances handle multiple requests in parallel by default, no configuration
necessary).
* Convert to gradle 7.
* More fixes, regenerated lockfiles.
* Update lockfiles for dependency update.
* Fix show_upgrade_diff for new lockfile format
* Add property for allowInsecureProtocol
Allow us to override the restriction against use of plain HTTP for
communication to dependency repositories. We need this to be able to use a
local proxy for dependency gathering.
* Checking in missing gradle.lockfile
* Split failing dns update batches and kill after 10 retries
* format fixes
* Add another test
* Switch to CloudTasks
* Change back to app engine header
* Change to immutableList and other changes
* Change to optional header
* Add bug ID to todo
* Switch to constructor injection
* Remove old queue
* Set response status
* Change to Optional<Integer>
* Rename action status
* Switched to use CLoudTaskHelper
* Remove spy in test
This is, as of now, unused but we can use it for b/237683906 and
b/237800445 in the future to allow for special behavior dictated by
allocation tokens rather than having to reserve specific domains.
Note that we enforce a tied domain for ANCHOR_TENANT tokens (because
they should be matched to a domain) but not for BYPASS_TLD_STATE tokens.
This deletes LastSqlTransaction and ReplayGap in Datastore, as well as
SqlReplayCheckpoint and TransactionEntity in SQL. These are all related
to replay and are no longer used.
* Remove columns for unused ofy key reconstitution
Remove the columns in Domain, DomainHistory, GracePeriod and
GracePeriodHistory that were only used for Ofy key reconstitution.
All uses of these were removed in #1660.
* Add forgotten flyway file.
* Re-add database migration state commands
These were removed in PR #1661, but we do still need them for the time being
until we complete the ID migration as well.
This uses the Apache commons CSV parsing instead of rolling our own.
Annoyingly, the results that we're given aren't exactly proper CSVs
since they have a non-standard line of data at the top, and the header
is actually the second line.
Fortunately this no longer requires a log-in, we can just send a GET
request and receive a CSV result in return.
This also adds the apache-commons CSV parser to the dependencies
See https://b.corp.google.com/issues/237784559 for more details
* Resolve conflict
* Fix setup for existing test cases in info and check flow
* Revise info flow test cases
* Fix lint
* Merge branch 'master' into handlefeerequest-renew
* Address code review comments myself
* Merge branch 'master' into handlefeerequest-renew
* Get test passing
* Add check flow tests
* Format, consolidate test helpers
* Don't unnecessarily specify XML name
Now that ofy keys aren't necessarily being restored, it seemed prudent to
audit existing uses to verify that we aren't relying on any keys that may now
be null.
This fixes one case that appeared to be potentially problematic (in
ResourceFlowUtils), removes a few methods that call getOfyKey() but are no
longer used, and adds comments to one use of the key that appears to be safe
after visual inspection.
This includes:
- deletion of helper DB methods in tests
- deletion of various old Datastore-only classes and removal of any
endpoints
- removal of the dual-database test concept
- removal of 'ofy' from the AppEngineExtension
This removes the code that converts between ofy fields and SQL fields in DomainContent and a number of related core classes (basically anything that also needed modification to support the removal from DomainContent).
Cursor was originally envisioned to support arbitary ImmutableObject
scopes. However, in practice only the Registry scope is used. The SQL
representation of Cursor assumes that and the schema uses a composite ID
with a string column for the primary key of the scope object. Without a
schema migration to persist the VKey of the scope, we cannot support any
ImmutableObject other than those with a primitive string primary key.
Given the complexity involved and the limited use case, the scope is now
explictly limited to Registry only.
Also removed mapreduces that depends on Ofy keys of Cursors, and made
some code quality improvement based on IntelliJ suggestions on modified
files.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1672)
<!-- Reviewable:end -->
These tests use Ofy exclusively and should not run anymore, as any class
they test also use Ofy and should be deleted.
More importantly, running tests in Ofy mode makes it hard to remove Ofy
from entities, especially Registrar and RegistrarContact, as some of
them are created as canned data when tests are initiated, and the
creation would fail if they are not registered as Ofy entities.
It is therefore a prerequisite to disable these tests before we can
remove Ofy from those entities. We could have deleted them, but I think
that should be done when the corresponding classes tested by them are
deleted.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1679)
<!-- Reviewable:end -->
GPG1 is deprecated and stuck in v1.4 from 2018. GPG2 is recommended. We
only use the GPG binary in tests and when the host system has both
versions it causes problems because we hardcode the GPG import command
in GpgSystemCommandExension to use the binary named "gpg", which could
be linked to either GPG1 or GPG2, causing the other test to fail when
the version of GPG that runs in tests is incompatible with the version of GPG
that imports the keys.
With this PR we only support GPG2 from now on.
Added methods that exist in AppEngineExtension that preload some canned
data. This data is loaded by default and a lot of tests rely on them. As
we migrate away from App Engine, it is helpful to have them in the JPA
test extension which will replace the app engine extension that is
used to set up the database in dual database tests.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1673)
<!-- Reviewable:end -->
I'm not 100% sure that this is strictly necessary, but for now we can
replicate the ability to generate zonefiles for any point in time in the
recent past.
* Migrate ReadDnsQueueAction to use CloudTasksUtils
Also marked TaskQueueUtils as deprecated and fixed a few linter errors.
Note that DNS pull queue still requires the use of the GAE Task Queue API.
* Fix a test failure
* Remove TaskQueueUtils from VKeyTest
* Remove the @error exception that was inadvertently pulled in
One of the more significant changes introduced in this PR is that we use
SQL as the backing database in all tests unless otherwise specified,
e.g. by using the TmOverrideExtension. We change various ofy-related
tests to use this.
This includes various changes:
- Deletion of SqlEntity/DatastoreEntity and related classes. Includes
any necessary changes because of that (e.g. getting a nice SQL key on
error in RegistryJpaIO).
- Deletion of classes that used libraries from the init-sql code
(RefreshDnsOnHostRenameAction)
- Removal of the JpaTransactionManager's backup implementation
- Modification of RegistryJpaWriteTest to not use init-sql code
- Removal of the Transaction class and related classes, however it does
not remove the TransactionEntity class as that would require DB
changes
- Removal of anything related to the actual usage of the database
migration schedule or read-only phases
- Various test changes and fixes to account for the differences in SQL
(like how foreign keys need to exist)
This deliberately doesn't do anything to alter the objects actually
stored in the DB yet, just how we use them
This is the only user of the ofy code that will stick around at least
until we move to the new registrar console. By removing references to
the transaction manager, we will be able to delete all the tm code
without interfering with this.
* Remove version pin for java-diff-utils dependency
Latest version of the lib introduces a small behavior change/bug fix.
It no longer ignores empty lines. This actually makes sense.
Update the test data to reflect this change.
The main purpose of this PR is to help debug b/234189023, where a
registrar reported that in sandbox they observed seemingly successful EPP
update responses to delete NS records, which are not actually deleted after
the commands executed.
To actually load the persisted domain resource after an update would
require us to execute another transaction immediately after the update
transaction and that can only be achieved outside the flow (i. e. in
FlowRunner or EppController) and we need to test for the type of flows
before logging, which seems unnecessarily complex.
For now we are just adding logs inside the update transaction itself to
validate that:
1. The NS records to delete are as expected.
2. The Current NS records are as expected.
3. The new NS records to persist are as expected.
The EPP success reply is the default reply when no errors are thrown in
a transaction. If we see a success reply (which means that the
transaction finished successfully) and expected logs from the transaction, the
only explanation could be that somewhere in the ORM layer the java
representation of what the entity is is different from what is being
presented to the database. I think that signals a much bigger and
fundamental problem, which is quite unlikely given how isolated the
issue under consideration is.
In any case we would like to add the logging functionality in sandbox and ask
the registrar to report again when they see similar issues.
Also made some typo and linting fixes.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1663)
<!-- Reviewable:end -->
* Fix some small transactional issues in SQL mode
These weren't caught until I switched the default database type in tests
to be SQL (separate PR). Fortunately these don't seem to be catastrophic
This includes:
- removing the actions that do the replay
- removing the tests for the replay
- removing the ReplayExtension and adjusting the various tests that used
it appropriately
- removing functionality relating to "things that happen during replay",
e.g. beforeSqlSaveOnReplay
This does not include:
- removing the InitSqlPipeline or similar tasks
- removing e.g. SqlEntity (it's used in other places)
- removing Transforms/RegistryJpaIO and other SQL-pipeline-creation code
* Remove bracket around varname in CloudBuild script
Due to spinnaker restriction: it cannot handle variable references where the var name has brackets around it.
Added spinnaker error message to the comments
This included removing ofy-specific code from various tests. Also, some
of the other tests (e.g. RdapDomainActionTest) had to be configured to
use only SQL -- otherwise, as it currently stands, they were trying to
use ofy.
We also delete the CreateSyntheticHistoryEntriesAction and pipeline
because they're no longer relevant, and impossible to test (the goal of
the actions were to create objects in ofy which doesn't happen any
more).
We're running into issues pulling 2.1.3 from maven, possibly due to
vulnerabilities in dependencies, so this updates it to the most recent
version of 2.2.6.
* Add a new recurrenceLastExpansion column to the BillingRecurrence table
This will be used to determine which recurrences are in scope for
expansion. Anything that has already been expanded within the past year is
automatically out of scope.
Note that the default value is set to just over a year ago, so that, initially,
everything is in scope for expansion, and then will gradually be winnowed down
over time so that most recurrences at any given point are out of scope. Newly
created recurrings (after the subsequent code change goes in) will have their
last expansion time set to the same as the event time of when the recurring is
written, such that they'll first be considered in-scope precisely one year
later.
* Summarize schema related tests
Document existing schema-related tests including presubmit tests and
the schema-verify predeployment test newly added to Spinnaker.
* Inject a DomainPricingLogic into ExpandRecurringBillingEventsAction
This will be used in other PRs to set the renewal price correctly based on the
renewal price behavior of the BillingRecurrence event.
Note that, in order for this to work, a not-null constraint has been lifted on
the EPP flow state when the DomainPricingCustomLogic is being constructed, as
the pricing here will occur in a backend action outside the context of any EPP
flow.
* Slightly improve performance of ExpandRecurringBillingEventsAction
We don't need to log every single no-op batch of 50 Recurrences that are
processed (considering we have 1.5M total in our system), and we also don't need
to process Recurrences that already ended prior to the Cursor time (this gets us
down to 420k from 1.5M).
* Disable Ofy tests.
This change just turns off the Ofy tests at the root, by removing processing
for dual tests and disassociating the TestOfyOnly annotation from test
annotations.
This is far less comprehensive than #1631, but it's probably worth entering as
a stopgap solution just because it should speed up our test runs and unblock a
lot of other cleanup work.
* Fix DualDatabaseTestInvocationContextProviderTest
* Optimize RDAP entity event query
For each EPP entity, directly load the latest HistoryEntry per event type
instead of loading all events through the HistoryEntryDao.
Although most entities have a small number of history entries, there are
a few entities with many entries, enough to cause OutOfMemory error.
We have backend max-instances set to 100, which apparently exceeds the default
quota for GAE. Add info on updating the quota or changing this parameter to
the configuration doc.
* Add batching to ExpandRecurringBillingEventsAction
It's OOMing on trying to load every single BillingRecurrence that needs to be
expanded simultaneously (which is to be expected). So this processes them in
transactional batches of 50.
* Cleanup gpg-agent instances and home directories
The GpgSystemCommandException leaks home directories, but more importantly it
leaks gpg-agent instances. This can cause problems with inotify limits, since
the agent seems to make use of inotify. Do a proper cleanup in afterEach().
* Don't fail if we can't kill the agent
We'll delete the associated code soon enough too, but it's safer to delete the
cron jobs first and run in that state for a week, so we can still run them
manually if need be.
* Reduce the number of manually scaled instances for default/pubapi
This is in the spirit of "not always running significantly over-provisioned",
which helps to save costs and also expose potential scaling issues when they are
still small rather than all at once when they're a big problem.
This can always be reverted if necessary, and can be instantaneously adjusted by
running the `nomulus set_num_instances` command.
* Don't enforce billing account map check on TEST TLDs
This was affecting monitoring (i.e. prober TLDs). Note that test TLDs are
already excluded from the billing account map check in the Registrar builder()
method (see PR #1601), but we forgot to make that same test TLD exclusion in the
EPP flows check (see PR #1605).
* Add test for Java 8 Compatibility
Add a test to check for Java 8 compatibility of jars deployed to
AppEngine.
It is not enough to run existing tests with Java 8 VM, since many API
jars are not exercised by tests. For example, those for GCP services
like the SecretManager.
We take the conservative approach and verify that every class in every
jar are compiled for Java 8.
We would like to re-use the build cache when building RCs for different
environments. There's not much practical use in doing a "clean" for
every build when Gradle should be able to figure out which artifacts
need to be rebuilt. It also does not make sense to build each
environment in a separate step, which also introduces redunency because
not all artifacts are cached across steps. The build cache is enabled by
default.
Lastly, the cache needs to be inside the /workspace folder, which is the
default persisted storage location.
TESTED=tried to build the RCs on alpha and saved about 10 min.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1619)
<!-- Reviewable:end -->
* Add missing transaction for whois lookups
Nameserver whois lookups are failing under SQL for hosts with superordinate
domains because the query in this case is not done in a transaction. We
missed this during testing because a) we didn't have a test for lookups of
hosts with superordinate domains and b) we missed converting
NameserverWhoisResponseTest to a DualDatabaseTest.
This PR fixes the problem and adds the requisite testing.
* Use a single transaction to get host registrars
* Replace streaming with Maps.toMap()
* Downgrade dependencies that no longer support Java8
Downgrade two dependencies whose latest versions no longer support
java8.
A follow up PR will add java8 compatibility to presubmit tests.
We have recently started to routinely breach the 1h timeout. Increasing
this value to 2h. We should also look into reusing the artifacts when
building RCs for different environments.
* Use Gradle dependency dynamic versioning
Use dynamic versioning for Gradle dependencies when possible.
Please refer to go/dr-dependency-upgrade for more information about the
automation plan.
This PR calls out all dependencies that must be pinned to specific
versions for various reasons. The remaining ones are converted to
open-ended version ranges ("[version_str,)").
* Check PAK on domain create
* Add unit test
* update docs
* Remove unneccesary setup
* Fix blank line
* Add check and test to all relevant flows
* Change error message
* Ignore version UIDs during txn deserialization
When deserializing transactions for replay to datastore, ignore class version
UIDs that don't match those of the local classes and just use the local class
descriptors instead. This is a simple solution for the problem of persisted
VKeys containing references to classes where the class has been updated and
the serial version UID has changed.
Also add a "replay_txns" command that replays the transactions from a given
start point so we can verify all transactions are deserializable.
TESTED:
Ran replay_txns against all transactions on sandbox beginning with
transaction id 1828385, which includes Recurring billing events containing
both the old and the new serial version UIDs.
* Change check for root directory during rollback
`rollback_tool` tries to infer the root of the nomulus tree by checking for a
directory named "nomulus". This is potentially problematic (and, indeed, was
for me) since there is no guarantee what that directory will be named.
There are a number of features that characterize the root directory. Check
for the presence of the `rollback_tool` wrapper script, as this is both at
root level and tightly coupled to the python code, so hopefully we won't
move it without testing that the script still works.
This will require edits to a substantial number of registrars on sandbox (nearly
all of them) because almost all of them have access to at least one TLD, but
almost none of them have any billing accounts set. Until this is set, any updates
to the existing registrars that aren't adding the billing accounts will cause
failures.
Unfortunately, there wasn't any less invasive foolproof way to implement this
change, and we already had one attempt to implement it on create registrar
command that wasn't working (because allowed TLDs tend not to be added on
initial registrar creation, but rather, afterwards as an update).
* Downgrade Caffeine to 2.9.3
Apparently Caffeine >=3.* requires Java 11, and we're still stuck on Java 8
because of App Engine Standard. Fortunately this doesn't affect the exposed
interface we're using, so we can simply go back to the newest Caffeine version
once Registry 3.0 Phase 3 (GKE migration) is completed.
Now that SQL is the default, we do not need this side job to run
alongside the main one. Its purpose was to validate the BEAM pipeline
while Datastore was primary.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1599)
<!-- Reviewable:end -->
* Create a Dataflow pipeline to resave EPP resources
This has two modes.
If `fast` is false, then we will just load all EPP resources, project them to the current time, and save them.
If `fast` is true, we will attempt to intelligently load and save only resources that we expect to have changes applied when we project them to the current time. This means resources with pending transfers that have expired, domains with expired grace periods, and non-deleted domains that have expired (we expect that they autorenewed).
For some inexplicable reason, the RDE beam pipeline in both sandbox and
production has been broken for the past week or so. Our investigations
revealed that during the CoGropuByKey stage, some repo ID -> revision ID
pairs were duplicated. This may be a problem with the Dataflow runtime
which somehow introduced the duplicate during reshuffling.
This PR attempts to fix the symptom only by deduping the revision IDs. We
will do some more investigation and possibly follow up with the Dataflow
team if we determine it is an upstream issue.
TESTED=deployed the pipeline and successfully run sandbox RDE with it.
* Begin migration from Guava Cache to Caffeine
Caffeine is apparently strictly superior to the older Guava Cache (and is even
recommended in lieu of Guava Cache on Guava Cache's own documentation).
This adds the relevant dependencies and switch over just a single call site to
use the new Caffeine cache. It also implements a new pattern, asynchronously
refreshing the cache value starting from half of our configuration time. For
frequently accessed entities this will allow us to NEVER block on a load, as it
will be asynchronously refreshed in the background long before it ever expires
synchronously during a read operation.
* Add new columns to BillingEvent.java
* Improve PR and modifyJodaMoneyType to handle null currency in override
* Add test cases for edge cases of nullSafeGet in JodaMoneyType
* Improve assertions
* Remove dos.xml from the configs
We don't have dos config right now, and applying dos from "gcloud app
deploy" is deprecated and has started causing problems.
If we add dos configs, it should be using "gcloud app firewall-rules".
* Build Java8-compatible release
Use the new options.release Gradle property to make sure builds are
compatible with Java 8, which is the runtime on Appengine.
This new property replaces sourceCompatibility, targetCompatibility, and
bootclasspath (wasn't previously set, which is the reason why we
couldn't detect Java9 api usage when building).
* Ignore read-only when saving commit logs
Ignore read-only when saving commit logs and commit log mutations so that we
can safely replicate in read-only mode. This should be safe, as we only ever
to the situation of saving commit logs and mutations when something has
already actually been modified in a transaction, meaning that we should have hit
the "read only" sentinel already.
This also introduces the ability to set the Clock in the
TransactionManagerFactory so that we can test this functionality.
* Changes per review
* Fix issues affecting tests
- Restore clobbered async phase in testNoInMigrationState_doesNothing
- Restore system clock to TransactionManagerFactory to avoid affecting other
tests.
* Change billingIdentifier to BillingAccountMap in invoicing pipeline
* Add a default for billing account map
* Throw error on missing PAK
* Add unit test
Check for a PSQLException referencing a failed connection to "google:5433",
which likely indicates that there is another nomulus tool instance running.
It's worth giving this hint because in cases like this it's not at all obvious
that the other instance of nomulus is problematic.
* Add a no-async actions DB migration phase
This needs to be set several hours prior to entering the READONLY stage. This is
not a read-only stage; all synchronous actions under Datastore (such as domain
creates) will continue to succeed. The only thing that will fail is host
deletes, host renames, and contact deletes, as these three actions require a
mapreduce to run before they are complete, and we don't want mapreduces hanging
around and executing during what is supposed to be a short duration READONLY
period.
* Use UrlFetch for RDE and default TLS (1.2) for other URL connections
This removes the TLS 1.3-settings in the module providers and,
essentially, reverts the changes in #1535 only to the RdeReporter and
RdeReportActionTest
We have a cron job that runs the RDE upload action every 4 hours for all
TLD. Normally this should be a no-op beacuse a RDE upload is scheduled
after RDE staging is completed, and when it fails with non-2XX status it
will retry. However if for some reason it failed due to 20X status (like
waiting for the SFTP cursor), it will not retry but rely on the cron job to
catch up.
With the BEAM RDE pipeline every staging job saves all its deposits in a
uniquely named folder to avoid the need to use a lock, which is not
practical in BEAM. However the cron job has no way of knowing what the
prefixes are for each TLD so it will fail in SQL mode.
In this PR we implemented a logic to guess what the prefix should be and
use it, if we are in SQL mode and a prefix is not provided.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1574)
<!-- Reviewable:end -->
* Set the initial worker count for the RDE beam pipeline at 24
This likely will speed up the pipeline by skipping the initially slow
process of spinning up instances.
* Fix sporadic SQL Snapshot failure
The Postgresql set-snapshot statement (called in
JpaTransactionManager.setDatabaseSnapshot() method) must be the first
statement in the SQL transaction.
Currenty the JpaTransaction.transact() method may insert a query for
DatabaseMigrationStateSchedule before the user query when the cache is
empty or the cached value expires.
This PR proactively preloads the cache in RegistryJpaIO to prevent cache
loading inside the transaction.
This PR also changes some DatabaseSnapshotTest tests to be retrying, in
case they run just after the cache expires. (This has happened before in
CI).
The format check script currently outputs "true" if there were files that need
reformatting and "false" if not, which is useful for gradle but less so for
other applications (notably commit hooks). Terminate with an exit code of 1
if the format check fails.
TESTED: Tried this from both a pre-commit hook and from the gradle build.
* Add a "list_txns" to dump Transaction table
Add the list_txns command which can dump the entire contents of the
Transaction table, either in csv format or as human readable transactions.
The CSV format is useful for storing the transaction table at a specific point
in time for later reference without requiring us to repeatedly hit the
replica.
Creating this without tests because this command has a very short shelf-life
and is really only intended to be run by developers. Tested all features
locally.
* Reformatted
* Ignore trivial differences when comparing DB
Some data difference are due to entity model differences and also
harmless. We should igore them when comparing Datastore and SQL.
This PR ignores the following diffs:
- null vs empty collection
- the empty string in Address.stree field, which is a list
* Bump flogger and beam dependency versions
Beam 2.34.0 -> 2.37.0
Flogger 0.7.3 -> 0.7.4
Intellij keeps getting confused about which version of Flogger we're
bringing in. Even though we had previously locked Flogger to 0.7.3, for
some reason it was still bringing in the Beam transitive dependency of
0.6.0 which was causing the a bunch of class initialization errors.
Bumping Beam to 2.34.0 bumps the transitive dependency to 0.7.4 so we
can always use that.
1. testRun_withPrefix() in RdeUploadActionTest does calls a mock lock
handler and does not actually try to read from the fake GCS
implementation. Therefore there's no point settig it up.
2. Remove an unused field in UploadDatastoreBackupActionTest.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1563)
<!-- Reviewable:end -->
* Remove static methods in back up actions
* Remove BigqueryPollJob helper class
* Add schedule time in task comparison
* Change payload type from byte[] to ByteString
* Fix a subtle issue in BRDA copy caused by Cloud Tasks
After the Cloud Tasks migration and #1508, the BRDA copy job now
routinely fail on the first try because the revision update is not
commited by the time the Cloud Tasks job enqueued in the same
transaction runs for the first time. This is because the enqueueing is
a side effect and not part of the transaction. The job eventually
succeeds because of retries.
This PR attempts to mitigate the initial failure by adding a delay to
the enqueued job, and checking the cursor in the job itself to prevent
it from running before the transaction is commited.
* Fix issues with saving and deleting gap records
Datastore limits us to mutating up to 25 records per transaction. We
sometimes exceed that when deleting expired gap records. In addition, it is
theoretically possible for us to accumulate enough continuous gap records to
exceed this count while replaying the original transaction.
Deal with deletion by breaking up the gap records to be deleted into a batch
size that is small enough to be deleted transactionally (in practice, we don't
much care about the transactionality but it doesn't seem like we can delete
batches without it).
Deal with the possibility of too many additions by always breaking out gap
record storage and last transaction number updates into their own
transaction(s) (separate from the replay of the original SQL transaction).
These are useful for the purposes of filtering by one-time/multi-use tokens, and
for determining which one-time tokens have been used (and if so, for which
domain).
* Track and replay Transaction table gaps
Id gaps in the Transaction table can be the result of a transactions committed
out of order. To deal with this, keep track of gaps for up to five minutes
and check to see if they've been back-filled prior to applying the next batch
of transactions during reply.
* Changes for review
* Calculate gap expiration time before gap queries
* Reformat.
* Add 3 more SQL indexes to the Host table
These indexes on creationTime, deletionTime, and currentSponsorRegistrarId are
present on the other two EPP resource tables (Domain and Contact), and are
useful for a wide variety of operations/analytics queries.
* Improve cache loading in Registries.java
The loader for the TLD cache in Registries.java unnecessarily reads from
another cache when running with SQL, potentially triggering additional
database access. This code runs in the whois query path, and contributes
to the high latency in sandbox.
The query analyzer identified this is a missing index on the BillingEvent table,
and I added it for recurrences and cancellations as well as it's likely to be a
problem for them too. "Give me all the billing events associated with a given
domain by its repo ID" seems like a pretty common use case for the DB (and does
appear to be used by our invoicing pipeline).
This is a follow-up to PR #1545.
These indexes were identified as missing by PostgreSQL's query analyzer in our
sandbox environment (where we get enough realistic EPP traffic to identify these
deficiencies).
Note that a lot of the new indexes being named have to use the DB representation
of the column name because they are either embedded or subclassed entities,
whereas most of the existing ones are able to simply refer to Java field names.
This is the Java schema follow-up PR to PR #1541, which is what added the
actual DB changes through Flyway scripts.
* Reorganize new schema changes
Reorganized new schema changes and make each flyway script update a
single table.
Each flyway script is executed in a single database transaction so that
the script can be rolled back in one shot. It acquires a shared lock on
all tables touched by the script. This is deadlock-prone because in a
busy database, there may be user queries that attempt to lock the same
set of tables, but in different order. By limiting each script to one
table, we avoid the problem.
We should have some a presubmit check to enforce this rule.
All changes have been deployed to Sandbox out-of-band. When doing so,
we changed all CREATE INDEX statements to CREATE INDEX IF NOT EXISTS.
Future deployments should be able to proceed normally.
* Revise Host index on inet_addresses
The index on the 'inet_addresses array column should be of gin or gist
type, which index individual array elements. We use gin for now since
host updates are not often, and gin has better accuracy.
Since flyway script V108__... has not been deployed, we edit the file
in place instead of adding a new script.
This will be followed up with a modified query that can take advantage
of the gin index. Until then we don't expect to see performance
improvement.
The suspected bottlenect query in the whois path is:
select * from "Host" where 'non-ip-string' = any(inet_address) and
deletion_time < now();
It needs to be revised into:
select * from "Host" where array['non-ip-string'] <@ inet_address and
deletion_time < now();
The combined change reduces the query time from 90ms to 30ms in Sandbox,
and from 150ms to 40ms in production.
It is unclear if this solves all problem with whois latency.
* Add deletionTime/inetAddresses indexes to Host table to support WHOIS
Weimin identified these as missing, and being the cause of slowdowns in
NameserverLookupByIpCommand that we're seeing in sandbox.
This is the first of two PRs, adding just the Flyway/schema changes. The
second PR adding the Java object model changes is #1547.
* Add domainRepoId indexes to billing events #1544
The query analyzer identified this is a missing index on the BillingEvent table,
and I added it for recurrences and cancellations as well as it's likely to be a
problem for them too. "Give me all the billing events associated with a given
domain by its repo ID" seems like a pretty common use case for the DB (and does
appear to be used by our invoicing pipeline).
This is the first of two PRs that makes just the DB changes. The second PR
(#1544) will add the Java code changes, and will be committed after this one is
deployed.
* Add 9 more indexes to SQL schema
This indexes were identified as missing by PostgreSQL's query analyzer in our
sandbox environment (where we get enough realistic EPP traffic to identify these
deficiencies).
This is the first of two PRs -- the second PR (#1540) will be merged only after
this one is live in production. Note that this is the PR that actually modifies
the database though, so once this one is deployed we will already have the
benefit of the new indexes.
- Use the standard HttpsURLConnection to write/read data
- Rewrite RdeReporter, Nordn*Action, and Marksdb classes and related
tests to conform to the new format
- Remove FakeURLFetchService and ForwardingUrlFetchService as they weren't used
- Refactor UrlFetchException to UrlConnectionException
- Refactor UrlFetchUtils to UrlConnectionUtils
I will need to test this on Alpha. Fortunately the connections that
don't require auth (e.g. TMDB downloading) should be testable.
* Allow replicateToDatastore to skip gaps
As it turns out, gaps in the transaction id sequence number are expected
because rollbacks do not rollback sequence numbers.
To deal with this, stop checking these.
This change is not adequate in and of itself, as it is possible for a gap to
be introduced if two transactions are committed out of order of their sequence
number. We are currently discussing several strategies to mitigate this.
* Remove println, add a record verification
* Fix hanging test
Tests using the TestServerExtension may hang forever if an underlying
component (e.g., testcontainer for psql) fails. This may be the cause
of the some kokoro runs that timeed out after three hours.
* Add a tools command to launch SQL validation job
Stopping using Pipeline.run().waitUntilFinish in
ValidateDatastorePipeline. Flex-templalate does not support blocking
wait in the main thread.
This PR adds a new ValidateSqlCommand that launches the pipeline and
maintains the SQL snapshot while the pipeline is running.
This PR also added more parameters to both ValidateSqlCommand and
ValidateDatastoreCommand:
- The -c option to supply an optional incremental comparison start time
- The -r option to supply an optional release tag that is not 'live',
e.g., nomulus-DDDDYYMM-RC00
If the manual launch option (-m) is enabled, the commands will print the
gcloud command that can launch the pipeline.
Tested with sandbox, qa and the dev project.
* Don't reset the update time for TLD updates
It turns out that the reason that the Registrar update timestamp isn't updated
for some of the tests is because the record is updated unchanged. We can
avoid this problem by not trying to update the registrar to the same value.
So in this case, if the registrar alreay contains the TLD we're adding, don't
try to add it.
* Fix entity delete replication, compare db @ replay
Replay tests currently only verify that the contents of a transaction are
can be successfully replicated to the other database. They do not verify that
the contents of both databases are equivalent. As a result, we miss any
changes omitted from the transaction (as was the case with entity deletions).
This change adds a final database comparison to ReplayExtension so we can
safely say that the databases are in the same state.
This comparison is introduced in part as a unit test for the one-line fix for
replication of an "entity delete" operation (where we delete using an entity
object instead of the object's key) which so far has only affected PollMessage
deletion. The fix is also included in this commit within
JpaTransactionManagerImpl.
* Exclude tests and entities with failing comparisons
* Get all tests to pass and fix more timestamp
Fix most of the unit tests that were broken by this change.
- Fix timestamp updates after grace period changes in DomainContent and for
TLD changes in Registrar.
- Reenable full database comparison for most DomainCreateFlowTest's.
- Make some test entities NonReplicated so they don't break when used with
jpaTm().delete()
- Diable checking of a few more entity types that are failing comparisons.
- Add some formatting fixes.
* Remove unnecessary "NoDatabaseCompare"
I turns out that after other fixes/elisions we no longer need these for
any tests in DomainCreateFlowTest.
* Changes for review
* Remove old "compare" flag.
* Reformatted.
* Make a few quality-of-life improvements in CloudTasksUtils
1. Update the method names. There are too many overloaded methods and it
is hard to figure out which one does which without checking the
javadoc.
2. Added a method in the task matcher to specify the delay time in
DateTime, so the caller does not need to convert it to Timestamp.
3. Remove the expilict dependency on a clock when enqueueing a task with
delay, the clock is now injected directly into the util instance
itself.
* Disable prober data deletion cron job in prod & sandbox
This is going to unnecessarily make the database migration more complex, and we
don't need them that badly. We'll re-enable these cron jobs once we've written
the new version of this action that handles Cloud SQL correctly (the current
version only does Datastore anyway).
* Ignore prober data when comparing databases
Completely ignore prober data when comparing Datastore and SQL.
Prober data deletions are not propagated from Datastore to SQL. It is
difficult to distinguish soft-deletes from normal updates, therefore
difficult to avoid false positives when looking for differences.
This is necessary because the Cloud Tasks API is not transactionally enrolled,
so it's possible that multiple tasks might end up being enqueued. We need to be
able to handle them.
* Fix update timestamps for DomainContent types
We expect update timestamps to be updated whenever a containing entity is
modified and persisted, but unfortunately Hibernate doesn't seem to do this --
instead it appears to regard such an entity as unchanged.
To work around this, we explicitly reset the update timestamp whenever a
nested collection is modified in the Builder.
Note that this change only solves the problem for DomainContent. All other
entitities containing UpdateAutoTimestamp will need to be audited and
instrumented with a similar change.
* Fix a handful of tests broken by this change
* Reformatted.
* Use CloudTaskUtils to enqueue
* Add CloudTasksUtilsModule to FrontendComponent
* Fix Uri query issue
* Remove header and check service in matcher
* Use a ThreadLocal boolean in TestServer to determine enqueueing
* Extract enqueuing and email sending from tm().transact()
* Add action to DB comparison pipeline
Add a backend Action in Nomulus server that lanuches the pipeline for
comparing datastore (secondary) with Cloud SQL (primary).
* Save progress
* Revert test changes
* Add pipeline launching
* Fix create/update timestamp replay problems
When CreateAutoTimestamp and UpdateAutoTimestamp are inserted into a
Transaction, their values are not populated in the same way as when they are
stored in the course of an SQL commit. This results in different timestamp
values between SQL and datastore during the SQL -> DS replay.
Fix this by providing these values from the JPA transaction time when we're
doing transaction serialization.
This change also removes the initialization of the Ofy clock in
ExpandRecurringBillingEventsActionTest. It's not necessary as the
ReplayExtension already takes care of this and doing it after the
ReplayExtension as we were breaks a test now that the update timestamps are
correct.
* Remove ReportingUtils and use CloudTaskUtil to enqueue
* Use schedule time helper to enqueue and update schedule time comparison
* Fix comment, indentation in gradle file and improve time comparison
* Change from TaskQueueUtils to CloudTasksUtils in LoadTestAction
* Put X_CSRF_TOKEN in task headers
* Fix schedule time and gradle issue
* Remove TaskQueue constant dependency
* Double run seconds
* Add comment for X_CSRF_TOKEN
* Fix flaky RdeStagingActionDatastoreTest
Fixed the most common cause that makes one method flaky (Clock and
timestamp problem). Added a TODO to rethink test case.
Also added notes on tasks potentially enqueued multiple times.
The gcloud command does some weird stuff with sorting when custom format
is used. Here we instead rely on linux sort and head command to sort the
versions list.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1511)
<!-- Reviewable:end -->
* Add an index on Host.host_name column
This field is queried during host creation and needs an index to speed
up the query.
Since Hibernate does not explicitly refer to indexes, we can change the
code and schema in one PR.
* Use the built-in replicaJpaTm() in RDAP
This includes a test for the replica-simulating transaction manager and
removal of any replica-specific code in RDAP tests, because it's
unnecessary due to the existing tests.
* Add DS validation to match Cloud DNS
* Add checks to flows
* Add some flow tests
* Add tests for DomainCreateFlow
* Add tests for UpdateDomainCommand
* Fix docs test
* Small fixes
* Remove builder from tests
Standard mode will determine the watermarks based on the cursors and
kick off subsequent uploading steps. In order to run both the Beam and
the Mapreduce pipeline in parallel, we need to allow setting the beam
parameter when in standard mode. This changes should have been part of
https://github.com/google/nomulus/pull/1500.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1505)
<!-- Reviewable:end -->
The cached methods are only used in situations where we don't really
care about being 100% synchronously up to date (e.g. whois), and they're
not used frequently anyway, so it's safe to use the replica in these
locations.
* Compare migration data with SQL as primary DB
Add a BEAM pipeline that compares the secondary Datastore against SQL.
This is a dumb pipeline to be launched by a driver (in a followup PR).
Manually tested pipeline in sandbox.
Also updated the ValidateSqlPipeline and the snapshot finder class so
that an appropriate Datastore export is found (one that ends before the
replay checkpoint value).
We can use it more places later but this can serve as a template. We
should inject the connection to the read-only replica (only created
once) to the constructor of the action, then use that instead of the
regular transaction manager.
We add a transaction manager that simulates the read-only-replica
behavior for testing purposes as well.
In addition, we set the transaction isolation level to READ COMMITTED
for this transaction manager (this is fine since we're never writing to
it). Postgres requires this for replica SQL access (it fails if we try
to use SERIALIZABLE) transactions. We didn't see this with the pipelines
before since those already had transaction isolation level overrides
* CommitLog handling code should call ofyTm
The tm() call will use JPA transaction manager after the switch-over to
SQL. These calls would lose their transaction semantics.
Both actions are to be invoked after the switchover in case we have to
switch back to Datastore as primary.
* Only compare recent changes in Datastore and SQL
When comparing Datastore and SQL, ignore older History and EPP resource
objects. This cuts the run time in half compared with a full comparison.
The intention is to run a full comparison before the switch-over from
Datastore and SQL, and run this incremental comparison during the down
time.
The incremental comparison takes about 25 minutes in production.
Performance can be improved further by filtering out older billing
events (OneTime and Cancellation). However, we don't think further
optimization is worth the effort (considering that Recurring events
cannot be filtered since they are mutable but without lastUpdateTime).
Verified in Sandbox and prod with and without time filter.
We already have ValidateEscrowDepositCommand to check for internal
reference consistency of two deposits, i. e. making sure that all
contacts and hosts referenced by domains exist in the same deposit.
Therefore to compare whether two deposits are equal we only need to make
sure that they contain the same domains and registrars, assuming they
both pass the validation. We don't compare their contents directly
because the MapReduce deposit contains all contacts and domains whereas
the Beam deposit only contains referenced ones, making a direct
comparison impossible.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1476)
<!-- Reviewable:end -->
* Speed up updating of premium lists
There are two parts to this:
1. Don't load the premium entries in the command prompt (this isn't
necessary and we didn't display that information anyway).
2. Set a proper batch size (rather than just 1) when saving all the
premium entries. This means that we generate only one INSERT statement
rather than N statements.
* Allow usage of a read-only Postgres replica
This adds the Dagger provider code for both the regular and the BEAM
environments, which are similar but not quite the same.
In addition, this demonstrates usage of the replica DB in the
RdePipeline. I tested this on alpha with a modified version of the
RdePipeline that attempts to write some dummy values to the database and
it failed with the expected message that one cannot write to a replica.
* Resolve ResaveEntityAction related conflicts
* Replace string with existing constants
* Remove solved TODOs related to ofy string to new vkey string
* Add a TODO for clean up
* Fix missing annotation
* Make not logged in errors take precedence over extension errors
This is the right order to do the checks in, because if the registrar isn't
logged in (or their login failed) then they will have an empty set of declared
extensions, so any attempt to use an extension will throw a "Service
extension(s) must be declared at login" error. This is potentially misleading
because the actual error in this situation is that the registrar isn't logged
in at all.
This also fixes some flows that weren't declared final (but should be), or
methods declared final on final classes, which is superfluous.
* Don't throw errors when existing premium list is empty
This state is possible to get into when things go wrong and it shouldn't prevent
saving new revisions of the list. Note that it will continue to throw errors if
you attempt to save a new revision that is blank (which is usually a mistake).
See http://b/211774375
* Make premium list saving run as a single transaction
This fixes the bug where the new revision is saved, but then execution gets
halted for some reason (e.g. request timeout) before the entries finish saving,
which leaves the DB in a bad state with a new top revision containing zero
entries, thus making everything standard.
* Add pending action extension to server update poll messages
This is necessary for the poll messages to contain the necessary context
explaining what domain name the relevant statuses were being added/removed
to/from.
* Always use JPA TM on Beam
Beam does not have access to datastore. Using ofy on Beam always results
in an error. Normally we should use database migration state schedule to
determine which TM to use, but on Beam there's no point in doing so. By
hard-coding the TM on beam to be SQL we can start testing features before
we migrate to SQL mode, for example the new RDE pipeline.
Also made a change to where the manual deposits are stored. It made more
sense to store them under manual/[direcitory]/[jobname]/ instead of
[jobname]/manual/[directory]/.
TESTED=deployed the pipeline on production and ran a job.
* Small fixes to show_upgrade_diffs
- fix fetch for an existing directory (we can't fetch to local "master"
branch, use "origin/master" instead).
- add a newline after "removed" entries.
This version of Beam does not have an explicit dependency on log4j.
There are a couple of other things that need to change due to the
upgrade.
1) The new version pulls in a dependency that is not on Maven Central
but on packages.confluent.io, so we need to explicitly add this repo.
2) The new version has a dependency on flogger 0.6 anb above , which removed
the LoggerConfig class (see google/flogger#142).
We therefore backported the class. In the long term we should do what
was suggested in the issue and use the normal JDK Logger config
directly.
3) The intSqlPipeline dependency graph also needs to be updated.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1472)
<!-- Reviewable:end -->
* Ignore Prober related entities when comparing db
Deletion of prober entities are not propagated to SQL, resulting in two
types of mismatches: entity only exists in SQL, or copies of an entity
differ in deleteTime. Both cases should not count as erros.
* Make ImmutableObject.toString deterministic
Remove the identity hash from the output. There is no use case
(including debugging) for it.
Removing it allows us to also remove some overriding implementations in
subclasses, and may also simplify tests.
* Improve logging/comments for commit log forks
It looks like the diff file lister is doing a second constructDiffSequence()
when a commit diff file is missing from the final sequence for purely
informational purposes. However, this purpose wasn't clear when investigating
an actual case of this.
This PR adds another warning to hopefully make the log output a bit more
useful, and also promotes the "gap" log message to a warning and adds a
comment indicating the purpose of the second constructDiffSequence().
This adds two new options:
1) An option to run RDE in lenient mode.
2) An option to run RDE with the new Beam pipeline regardless of the datastore setting.
<!-- Reviewable:start -->
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1453)
<!-- Reviewable:end -->
* Copy into PersistentSets in Domains if applicable
This is similar to https://github.com/google/nomulus/pull/1456
It is possible that in some cases we could get an exception:
Caused by: org.hibernate.HibernateException: A collection with cascade="all-delete-orphan" was no longer referenced by the owning entity instance: [parent]
The main cause of this, according to research (StackOverflow :P) is that
when Hibernate is calling the setters for these sets of children it's
losing the connection to the previously-managed child entity (which it
needs, in order to know how to delete orphans). Thus, the solution is to
maintain the same instance of the persistent set and just add/remove
to/from it as necessary.
This is complicated by the fact that sometimes the setter is given the
persistent set (the one we want to keep) and sometimes (?) it isn't.
In replay (and possibly in other cases) we're getting an exception:
Caused by: org.hibernate.HibernateException: A collection with cascade="all-delete-orphan" was no longer referenced by the owning entity instance: google.registry.model.domain.DomainHistory.internalDomainTransactionRecords
The main cause of this, according to research (StackOverflow :P) is that
when Hibernate is calling the setters for these sets of children it's
losing the connection to the previously-managed child entity (which it
needs, in order to know how to delete orphans). Thus, the solution is to
maintain the same instance of the persistent set and just add/remove
to/from it as necessary.
This is complicated by the fact that sometimes the setter is given the
persistent set (the one we want to keep) and sometimes (?) it isn't. We
will need to try this out to be sure.
see https://b.corp.google.com/issues/208629747 for details; this brings
in an old Gradle version as a transitive dependency
Version 2.x of the task-tree plugin uses Gradle 6.8 (or higher)
The cardinality for the paths is unbound, and could generate a huge
amount of metrics if someone is scanning our web WHOIS endpoint.
See b/209488119 for an example of such a sudden increase in metric volume.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1451)
<!-- Reviewable:end -->
* Add replicateToDatastore to non-prod cron files
This shouldn't do anything yet (since ReplicateToDatastoreAction checks the
migration state before doing anything) but we'll want to have this in
place.
* Filter out empty dsData objects, not just null ones
Hibernate/SQL will get mad if the digest is null or empty, and
previously we only check for null. We should filter out empty digests as
well.
* Properly handle Joda Money in JPA
Joda Money has BigDecimal as amount, which is mapped to a numeric(19,2)
column in the database. As a result, the Money amount load from DB has
scale 2. This becomes a problem with currencies such as JPY, which
requires scale to be 0. To properly load a currency, we must adjust the
scale post-load.
The current approach, which uses Hibernate component mapping, puts the
burden of post-load cleanup on each entity type that uses Money. It is
easy to forget this, as we just discovered.
This PR uses a CompositeUserType to map Money. It adjusts the scale
properly when loading Money instances. Although CompositeUserType appear
to be deprecated in Hibernate 6, it is the only proper solution right
now for mapping non-owned classes.
This is what's causing https://b.corp.google.com/issues/208274109, where
there are DTR rows with null foreign key values.
We should probably wait to make the columns officially non-null until we
get this in and verify that we can do so.
* Write commit logs during SQL->DS replay
Previously, we had no way to ignore read-only mode while still writing
commit log backups. Now, we added this so we can write commit logs in
the SQL->DS replay.
Note:
- When moving to either of the DATASTORE_PRIMARY stages, one must
manually set the SqlReplayCheckpoint first. We don't write to SQL with
backup in this stage because we already wrote the transaction in
question to Datastore. The fact that we manually set the replay
checkpoint means that we'll ignore the extra commit logs that might
otherwise cause problems if we switched back and forth from
DATASTORE_PRIMARY to SQL_PRIMARY.
- The commit logs written during the SQL_PRIMARY phase will, ideally, be
unused. We write them here only so that in the event of a rollback to
Datastore, we will have them for RDE purposes.
This is a result of bad data (we should never allow a null digest) and
we'll need to fix that separately, but this allows us to not fail on
this during replay
* Add NotLoggedInException tests to flows and flow docs
This wasn't included in flows.md before because the test existed in
ResourceFlowTestCase. So even though the exception could be thrown and
even though this was tested, it wasn't picked up in the documentation
because the documentation is picked up from the corresponding concrete
test class.
* Validate SQL with Datastore being primary
Validates the data asynchronously replicated from Datastore to SQL.
This is a short term tool optimized for the current production database.
Tested in production.
We want to keep the read-only-mode-exception as an unchecked exception,
so we introduce a temporary check in the EppController that provides a
specific error message for this situation (rather than letting it fall
through to the generic "command failed" messaging
* Replace with stringify() and VKey.create(string)
* Convert implicit cases of VKey.fromWebsafeKey(string)
* Convert from Key to VKey to use stringify()
* Modify existing code to show correct string representation of a key
* Use VKey.create(websafeKey) to get ofy key in ResaveEntitiesCommand
* Add TODO note in CommitLogMutation and determine if key string should be modified
* Revert from stringify() to getOfyKey().getString()
* Add bug ids to TODOs
* Ignore read-only mode in SQL->DS replication process
We need to be able to save indices and save data about the replication
even when we're in read-only mode.
We can handle it the same way that we handle UpdateAutoTimestamp, where
we simply populate it in SQL if it doesn't exist. This has the following
benefits:
1. The converter is unnecessary code
2. We get non-null column definitions for free (overridden in
EppResource to allow null creation times so that legacy *History objects
can contain null in that field
3. More importantly, this allows us for proper SQL->DS replay. If the
field is filled out using a converter (as before this PR) then the field
is only actually filled out on transaction commit (rather than when the
write occurs within the transaction). This means that when we serialize
the Transaction object during the transaction (the data that gets
replayed to Datastore), we are crucially missing the creation time.
If the creation time is written on commit, we have to start a new
transaction to write the Transaction object, and it's an absolute
necessity that the record of the transaction be included in the
transaction itself so as to avoid situations where the transaction
succeeds but the record fails.
If the field is filled out in a @PrePersist method, crucially that
occurs on the object write itself (before transaction commit).
The original RDE pipeline was a direct translation of the App Engine
MapReduce logic. It turned out to be too slow (taking more than a day to
run) due to the way it finds the most recent history entry.
This PR overhauled the pipeline by using embedded EPP resource entities
inside history entries (only available in SQL) and finding the most
recent entries using the SQL engine. It cuts the time done to ~2h.
Note that there are quota limits on the CPU cores and external IP
addresses for a given GCP region inside a project, which will need to
accommodate the resource requirements for the pipeline. More details are
provided in comments.
Also merged the update cursor stage and enqueue next action stage in
RdeIO so that they can be done within a transaction, same as how
MapReduce handles them.
<!-- Reviewable:start -->
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1427)
<!-- Reviewable:end -->
* Change TaskOptions to Task in CommitLogFanoutAction
* Add a createTask method that takes clock and jitterSeconds
* Change CreateTask parameter type and improve test cases
* Improve comments and test casse
* Improve test cases that handel jitterSeconds
* Grandfather in old data for one-time billing event requirement
We have data from 2018 and earlier where we didn't consistently set periodYears
for OneTime BillingEvents with certain reasons. This grandfathers in that old
data so that we can successfully move it over to Cloud SQL for now, then we can
later run a query that will backfill it, after which we can then tighten up the
requirement again. Note that the requirement is still being enforced for all
billing events from 2019 onwards.
This also improves the handling of validation, by adding a private field to the
Reason enum rather than creating a throwaway inline ImmmutableSet in the
Builder.
BSD sed requires a parameter to -i to indicate the backup suffix. By
adding a blank suffix the sed command works on both Linux and macOS.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1421)
<!-- Reviewable:end -->
* Make TaskMatcher default to POST methods
TaskOptions.Builder.withUrl() defaults to POST methods. Therefore, it seems
reasonable to verify that task queue methods are using the POST method,
especially given that the method must now be identified explicitly when using
CloudTaskUtils. This check would have guarded against the bug fixed by #1413.
* Elaborate on comment
* Further improved the comment
* Remove the ineffective SQL injection check
Remove the ineffective SQL-injection attack check in go/r3pr/954. It is
quite restrictive, causing a long exempt list. It also doesn't protect
queries made through helpers such as QueryComposer etc.
We will start from scratch for a new solution.
* Add the Cloud SQL queries for transaction reports
* Add the remaining queries
* Some query fixes
* Fix comments
* Fix indentation in total_nameservers
* Fix indentation on other Case condition
* Fix InitSqlPipeline regarding synthesized history
There are a few bad domains in Datastore that we hardcoded to ignore
during SQL population. They didn't have history so we didn't try to
filter when writing history.
Recently we created synthesized history for domains, including the bad
domains. Now we need to filter History entries.
* Support shared database snapshot
Allow multiple workers to share a CONSISTENT database snapshot. The
motivating use case is SQL database snapshot loading, where it is too
slow to depend on one worker to load everything.
This currently is postgresql-specific, but will be improved to be
vendor-independent.
Also made sure AppEngineEnvironment.java clears the cached environment
in call cases when tearing down.
* Update terraform files and instructions
Update proxy terraform files based on current best practices and allow
exclusion of forwarding rules for HTTP endpoints. Specifically:
- Add a "public_web_whois" input to allow disabling the public HTTP
whois forwarding.
- Add "description" fields to all variables.
- Move outputs of the top-level module into "outputs.tf".
- Auto-reformat using hclfmt.
* Make entities serializable for DB validation
Make entities that are asynchronously replicated between Datastore and
Cloud SQL serializable so that they may be used in BEAM pipeline based
comparison tool.
Introduced an UnsafeSerializable interface (extending Serializable) and
added to relevant classes. Implementing classes are allowed some
shortcuts as explained in the interface's Javadoc. Post migration we
will decide whether to revert this change or properly implement
serialization.
Verified with production data.
This is used for the replay locks so that Beam pipelines (which will be
used for database comparison) can acquire / release locks as necessary
to avoid database contention. If we're comparing contents of Datastore
and SQL databases, we shouldn't have replay actively running during the
comparison, so the pipeline will grab the locks.
Beam doesn't always play nicely with loading from / saving to Datastore,
so we need to make sure that we store the replay locks in SQL at all
times, even when Datastore is the primary DB.
* Re-enable replay tests for most environments
This enables the replay tests except in environments where
the NOMULUS_DISABLE_REPLAY_TESTS environment variable is set to "true".
* Add a check for null
* Alt entity model for fast JPA bulk query
Defined an alternative JPA entity model that allows fast bulk loading of
multi-level entities, DomainBase and DomainHistory. The idea is to bulk
the base table as well as the child tables separately, and assemble them
into the target entity in memory in a pipeline.
For DomainBase:
- Defined a DomainBaseLite class that models the "Domain" table only.
- Defined a DomainHost class that models the "DomainHost" table
(nsHosts field).
- Exposed ID fields in GracePeriod so that they can be mapped to domains
after being loaded into memory.
For DomainHistory:
- Defined a DomainHistoryLite class that models the "DomainHistory"
table only.
- Defined a DomainHistoryHost class that models its namesake table.
- Exposed ID fields in GracePeriodHistory and DomainDsDataHistory
classes so that they can be mapped to DomainHistory after being
loaded into memory.
In PersistenceModule, provisioned a JpaTransactionManager that uses
the alternative entity model.
Also added a pipeline option that specifies which JpaTransactionManager
to use in a pipeline.
I observed an instance in which a couple queries from this action were,
for whatever reason, hanging around as idle for >30 minutes. Assuming
the behavior that we saw before where "an open idle serializable
transaction means all pg read-locks stick around forever" still holds,
that's the reason why the amount of read-locks in use spirals out of
control.
I'm not sure why those queries aren't timing out, but that's a separate
issue.
* Fix problems with the format tasks
The format check is using python2, and if "python" doesn't exist on the path
(or isn't python 2, or there is any other error in the python code or in the
shell script...) the format check just succeeds.
This change:
- Refactors out the gradle code that finds a python3 executable and use it
to get the python executable to be used for the format check.
- Upgrades google-java-format-diff.py to python3 and removes #! line.
- Fixes shell script to ensure that failures are propagated.
- Suppresses error output when checking for python commands.
Tested:
- verified that python errors cause the build to fail
- verified that introducing a bad format diff causes check to fail
- verified that javaIncrementalFormatDryRun shows the diffs that would be
introduced.
- verified that javaIncrementalFormatApply reformats a file.
- verified that well formatted code passes the format check.
- verified that an invalid or missing PYTHON env var causes
google-java-format-git-diff.sh to fail with the appropriate error.
* Fix presubmit issues
Omit the format presubmit when not in a git repo and remove unused "string"
import.
* Add a beam pipeline to create synthetic history entries in SQL
The logic is mostly lifted from CreateSyntheticHistoryEntriesAction. We
do not need to test for the existence of an embedded EPP resource in the
history entry before create a synthetic one because after
InitSqlPipeline runs it is guaranteed that no embedded resource exists.
# Learn more about CodeQL language support at https://aka.ms/codeql-docs/language-support
steps:
- name:Checkout repository
uses:actions/checkout@v4
- name:Set Java version
uses:actions/setup-java@v4
with:
distribution:'temurin'
java-version:'21'
# Initializes the CodeQL tools for scanning.
- name:Initialize CodeQL
uses:github/codeql-action/init@v3
with:
languages:${{ matrix.language }}
# If you wish to specify custom queries, you can do so here or in a config file.
# By default, queries listed here will override any specified in a config file.
# Prefix the list here with "+" to use these queries and those in the config file.
# Details on CodeQL's query packs refer to : https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs
|[](https://storage.googleapis.com/domain-registry-kokoro/internal/index.html)|[](https://storage.googleapis.com/domain-registry-kokoro/foss/index.html)|[](https://lgtm.com/projects/g/google/nomulus/alerts/)|[](https://github.com/google/nomulus/blob/master/LICENSE)|[](https://cs.opensource.google/nomulus/nomulus)|
|[](https://storage.googleapis.com/domain-registry-kokoro/internal/index.html)|[](https://storage.googleapis.com/domain-registry-kokoro/foss/index.html)|[](https://github.com/google/nomulus/blob/master/LICENSE)|[](https://cs.opensource.google/nomulus/nomulus)|

@@ -12,16 +12,16 @@ Nomulus is an open source, scalable, cloud-based service for operating
[top-level domains](https://en.wikipedia.org/wiki/Top-level_domain) (TLDs). It
is the authoritative source for the TLDs that it runs, meaning that it is
responsible for tracking domain name ownership and handling registrations,
renewals, availability checks, and WHOIS requests. End-user registrants (i.e.
renewals, availability checks, and WHOIS requests. End-user registrants (i.e.,
people or companies that want to register a domain name) use an intermediate
domain name registrar acting on their behalf to interact with the registry.
Nomulus runs on [Google App Engine][gae] and is written primarily in Java. It is
the software that [Google Registry](https://www.registry.google/) uses to
operate TLDs such as .google, .app, .how, .soy, and .みんな. It can run any
number of TLDs in a single shared registry system using horizontal scaling. Its
source code is publicly available in this repository under the [Apache 2.0 free
and open source license](https://www.apache.org/licenses/LICENSE-2.0).
Nomulus runs on [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine)
and is written primarily in Java. It is the software that
[Google Registry](https://www.registry.google/) uses to operate TLDs such as .google,
.app, .how, .soy, and .みんな. It can run any number of TLDs in a single shared registry
system using horizontal scaling. Its source code is publicly available in this
repository under the [Apache 2.0 free and open source license](https://www.apache.org/licenses/LICENSE-2.0).
@@ -43,12 +43,6 @@ by Joshua Bloch in his book Effective Java -->
<propertyname="message"value='Your Javadocs appear to use invalid <a> tag syntax in @see tags. Please use the correct syntax: @see <a href="http(s)://your_url">url_description</a>'/>
</module>
<!-- Checks that our Ofy wrapper is used instead of the "real" ofy(). -->
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.