The primary annoyance with this is that it means we need (or at least,
should) split all tests that use the fee extension into two separate
tests -- one that simulates non-prod environments, and one that
simulates prod environments. This leads to duplication of many tests but
that's fine since this is theoretically temporary.
* Configure cloud scheduler to trigger MoSAPI SLA status to cloud monitoring in production
- We have kept this job to trigger for every 3 minutes so that we get near to real time update for our task.
- This will not trigger metrics for now as we have not written Metrics triggering logic yet
- Logs are added
* Change Trigger scheduling from 3 minutes to 5 minutes
This primarily affects the EPP greeting. We already were erroring out when any
contact flows attempted to be run; this should just prevent registrars from even
trying them at all.
This PR is designed to be minimally invasive, and does not remove any of the
contact flows or Jakarta XML/XJC objects/files themselves. That can be done
later as a follow-up.
Also note that the contact namespace urn:ietf:params:xml:ns:contact-1.0 is still
present for now in RDE exports, but I'll remove that subsequently as well.
BUG= http://b/475506288
This means that attempting to add a status that is already present will now
fail, and attempting to remove a status that is not present will also now fail.
This also refactors the existing checks into a single verify method, rather than
having to call three separate methods from every callsite.
BUG= http://b/474645068
* add cloud profiler to dockerfile and start script
* add apt-get update
* change in cb machine type for nomulus
* fix typo
* add max worker limit to gradle tests
* Switch to root before doing apt-get
* correct dockerfile
* jetty/Dockerfile
* profiler service conditional to kubernetes container name
Many of the actual fee extension changes are based off Weimin's PR
https://github.com/google/nomulus/pull/2912, though this makes some
additional changes based on the XML schema and description from RFC 8748.
This adds tests for the DomainCheckFlow which is the most complex and
thorough user of the fee extension, but we'll want to add further tests
to the other domain flows to make sure they're handled correctly.
It just makes it possible to delete allocation tokens, otherwise we need
to do a linear search over the entire Domain and DomainHistory tables if
we ever want to delete something.
The RST testing expects us to fail if they try to remove an IP from a
host that already doesn't that have that IP, or to add one that already
exists (ditto on both for a domain's nameservers). I don't really see an
issue with our previous no-op implementation, but we need to do this to
pass the tests.
We need to have this enabled in sandbox, but we wish to wait to enable
it for production to make sure that the implementation is correct and
that clients can use it.
Soon we'll want to do something similar (but the opposite) with the old
fee extensions, where we **only** serve them in production (or maybe
unit test as well). That will allow us to pass the RST tests that depend
on only having the fee extension 1.0.
This is / will be required in https://datatracker.ietf.org/doc/rfc8748/.
I split this out from the rest of the fee-extension testing so that it
can be easily visible.
This commit introduces a new backend endpoint at `/_dr/task/triggerMosApiServiceState` that initiates the process of fetching the latest service states for all TLDs from the MoSAPI endpoint and exporting them as metrics to Cloud Monitoring.
The key changes include:
- A new `TriggerServiceStateAction` class that handles the GET request to the new endpoint.
- Logic within `MosApiStateService` to concurrently fetch states for all configured TLDs.
- A new `MosApiMetrics` class (currently a placeholder) responsible for sending the collected states to the monitoring service.
- Unit tests for the new action and the updated service logic.
This endpoint will be called periodically to ensure that the MosApi service health metrics are kept up-to-date.
this is coming from the schema https://datatracker.ietf.org/doc/rfc8748/
section 6.1. The class, that we use for "premium" notes, moved from the
command to the object itself.
I don't know where in the spec these are explicitly disallowed, but it
seems like good practice and we'll fail the RST tests if we don't
disallow them.
* Add GetServiceState action for MoSAPI service monitoring
Implements the `/api/mosapi/getServiceState` endpoint to retrieve service health summaries for TLDs from the MoSAPI system.
- Introduces `GetServiceStateAction` to fetch TLD service status.
- Implements `MosApiStateService` to transform raw MoSAPI responses into a curated `ServiceStateSummary`.
- Uses concurrent processing with a fixed thread pool to fetch states for all configured TLDs efficiently while respecting MoSAPI rate limits.
junit test added
* Refactor MoSAPI models to records and address review nits
- Convert model classes to Java records for conciseness and immutability.
- Update unit tests to use Java text blocks for improved JSON readability.
- Simplify service and action layers by removing redundant logic and logging.
- Fix configuration nits regarding primitive types and comment formatting.
* Consolidate MoSAPI models and enhance null-safety
- Moves model records into a single MosApiModels.java file.
- Switches to ImmutableList/ImmutableMap with non-null defaults in constructors.
- Removes redundant pass-through methods in MosApiStateService.
- Updates tests to use Java Text Blocks and non-null collection assertions.
* Improve MoSAPI client error handling and clean up data models
Refactors the MoSAPI monitoring client to be more robust against
infrastructure failures
* Refactor: use nullToEmptyImmutableCopy() for MoSAPI models
Standardize null-handling in model classes by using the Nomulus
`nullToEmptyImmutableCopy()` utility. This ensures consistent API
responses with empty lists instead of omitted fields.
* Add RST support in Sandbox
Added RST test label files as resources.
Added a RstTmchUtils class that loads appropriate labels according to
TLD pattern.
Temporarily changed label fetching in production to include the TLD
string, so that the new class may know which set of labels to use.
* Addressing comments
* Addressing comments
We do most of these on host create already so we should also do them on
host checks. The only added change is the character validation (our
existing hostnames all match these).
2306 signifies something that is syntactically valid but semantically
invalid (like if someone tried to register a .com domain). These errors
are for domain syntax that could never be valid, thus we should throw a
syntax exception instead of a policy exception.
Now that we have transitioned to the minimum dataset, we no longer
support any actions on contacts (and by the time this is merged /
deployed, all contacts will be deleted). We should just throw an
appropriate exception on all contact-related flows. We don't delete the
flows themselves, so that we can have an appropriate error message.
We also keep all the flows and XML templates around individually for now because we may be
required to continue to differentiate the requests in ICANN activity
reporting (e.g. srs-cont-create vs srs-cont-delete)
This is a follow-up to PR #2908, which relaxed this restriction on bare TLDs
only, but now we also allow it systemwide on domains and hostnames as well. The
rules against hyphens in these positions are still enforced on all parts of the
domain name except the last one. Correct handling of multi-part TLDs in this
regard is out of scope in this PR; a multi-part TLD that looked something like
".zz--foobar.foobar" would still fail validation. (But of course you cannot a
priori know just from looking at a 3-part string whether it might be a hostname
on a normal TLD, or a domain name on a 2-part TLD.)
This also has some annoying interactions with a trailing dot (indicating the
root), which need to be preserved, but otherwise don't affect how TLD validation
is handled.
BUG= http://b/471013082
The canonicalizeHostname() helper method is only suitable for use with domain
names or host names. It does not work on bare TLDs, because a bare TLD can
have hyphens in the third and fourth position without necessarily being an IDN.
Note that the configure TLD command already correctly allows TLDs with such
names to be created.
Note that we are still enforcing that the TLDs to be added exist, so they have
to pass all TLD naming requirements that are enforced on creating TLDs, and we
are still lowercasing the TLD names passed as arguments here (though we're no
longer punycoding them, although arguably that's not super useful on
command-line params anyway).
BUG= http://b/471013082
We used to publish test artifacts to a Maven repo on GCS, for use by
schema tests. For this to work with Kokoro, the GCS bucket must be
accessible to all users.
To comply with the no-public-user requirement, we store the necessary
jars at at well-known bucket and map them into Kokoro. This strategy
cannot be used on the Maven repo because only a small number of files
with fixed names may be mapped. With the Maven repo, there are too many
files to map.
This is a decently simple wrapper around the previously-created
BulkDomainTransferAction that batches a provided list of domains up and
sends them along to be transferred.
This PR finds instances where we previously checked if the feature flag
for contacts-prohibited was set and removes those checks, making the
contacts-prohibited behavior the only behavior. Because the tests didn't
have that feature flag set, this means we need to change a ton of tests
to remove contact references.
* Add mosapi client to intract with ICANN's monitoring system
This change introduces a comprehensive client to interact with ICANN's Monitoring System API (MoSAPI). This provides direct, automated access to critical registry health and compliance data, moving Nomulus towards a more proactive monitoring posture.
A core, stateless MosApiClient that manages communication and authentication with the MoSAPI service using TLS client certificates.
* Resolve review feedback & upgrade to OkHttp3 client
This commit addresses and resolves all outstanding review comments, primarily encompassing a shift to OkHttp3, security configuration cleanup, and general code improvements.
* **Review:** Addressed and resolved all pending review comments.
* **Feature:** Switched the underlying HTTP client implementation to [OkHttp3](https://square.github.io/okhttp/).
* **Configuration:** Consolidated TLS Certificates-related configuration into the dedicated configuration area.
* **Cleanup:** Removed unused components (`HttpUtils` and `HttpModule`) and performed general code cleanup.
* **Quality:** Improved exception handling logic for better robustness.
* Refactor and fix Mosapi exception handling
Addresses code review feedback and resulting test failures.
- Flattens package structure by moving MosApiException and its test.
- Corrects exception handling to ensure MosApiAuthorizationException
propagates correctly, before the general exception handler.
- Adds a default case to the MosApiException factory for robustness.
- Uses lowercase for placeholder TLDs in default-config.yaml.
* Refactor and improve Mosapi client implementation
Simplifying URL validation with Guava
Preconditions and refining exception handling to use `Throwables`.
* Refactor precondition checks using project specific utility
This will be necessary if we wish to do larger BTAPPA transfers (or
other types of transfers, I suppose). The nomulus command-line tool is
not fast enough to quickly transfer thousands of domains within a
reasonable timeframe.
After each deployment in sandbox or production, move the artifacts from
the corresponding release to a well-known location so that they can be
mapped to Kokoro in presubmit tests. The Kokoro-mapping does not need
public access to the GCS bucket.
The artifacts include the postgresql schema jar, the nomulus release
jar, and the uber jar of the nomulus schema integration test classes.
Every jar name consists of a fixed prefix and the environment. Each jar
of a new deployment overrides the previous copy.
These are old/pointless now that we've migrated to GKE. Note that this
doesn't update anything in the docs/ folder, as that's a much larger
project that should be done on its own.
Add a flag indicating that a Sandbox TLD should use the production
servers.
No additional TLD name pattern checks. Cloud DNS has an allowlist for
names that may use production servers.
Also updated default descriptive name generation: dropping the trailing
'.', and replacing remaining dots with '_'.
This contains the same codepoints from the
core/src/main/java/google/registry/idn/Latin-IDN.xml file, just in the old .txt
IDN format that Nomulus actually ingests.
We don't need to support the mix of GAE and GKE any more so we can get
rid of the GaeService bits and unify everything under one constant
service. This also allows us to reduce the number of services down to
four (FE, BE, PUBAPI, console) which is nice.
We've previously been using Scrypt since PR #2191 which, while being a
memory-hard slow function, isn't the optimal solution according to the
OWASP recommendations. While we could get away with increasing the
parallelization parameter to 3, it's better to just switch to the
most-recommended solution if we're switching things up anyway.
For the transition, we do something similar to PR #2191 where if the
previous-algorithm's hash is successful, we re-hash with Argo2id and
store that version. By doing this, we should not need any intervention
for registrars who log in at any point during the transition period.
Much of this PR, especially the parts where we re-hash the passwords in
Argon2 instead of Scrypt upon login, is based on the code that was
eventually removed in #2310.
Reformat the current schema file for RFC 8748 final version. This was
adapted from v0.12 is not fully consistent with the final schema
This helps highlight the differences we missed in PR 2855 when we check
in the official schema.
The servlets, at this point now that we're off GAE, are only used for
the test server (and, indirectly, in one BSA test). Instead of having
them all remain separate, we can unify them in one test servlet that
lives in the test/ folder.
This removes one avenue of potential confusion w/r/t how request routing
actually works and where we would want to add new routing.
We need to stop using maven repo on GCS to store artifacts for the
schema compatibility tests. After public access is removed from GCS
buckets, Kokoro won't be able to access it: normal access will be
denied, and the repo is too large to map (copy) to Kokoro VM as a
resource.
This PR uploads the relevant jars to each release's folder. See
go/dr-gcs-public-access-prevention for details.
With GKE, we don't need the individual servlets because the services
aren't partitioned out the same way they were in GAE.
We keep FrontendServlet and BackendServlet around for now as they serve
as the backbone for the local RegistryTestServer (for testing things
like the console).
did some cursory tests on alpha and things seem to be unaffected -- I
was able to curl RDAP (pubapi) and create domains
It's been over half a year now since we last used any of these and we definitely
no longer have any intentions of ever using App Engine again.
BUG= http://b/457471639
Still part of b/454947209, removing references to WHOIS where we can. We
keep the registrar type and the column names (at least for now) because
changing those is much more complicated.
We no longer need this now that no contacts can be applied to any domains at all.
A follow-up PR in subsequent weeks will delete the column from the DB schema.
BUG= http://b/448619572
Previously, we would have separate database calls for mapping from
foreign key to repo ID and then from repo ID to object. This PR modifies
those calls to load the resource directly (the old system was an
artifact of the Datastore key-value storage system).
In this PR, we merge the load-resource-by-foreign-key calls into a
single database load, as well as adding a separate cache object for
(foreign key) -> (resource). Now we cache, and have separate cleaner
code paths, for fk -> resource, fk -> repo ID, and repo ID -> resource.
Also removes the unused RdeFragmenter class
* Support Fee Extension standard in rfc 8748
Adding support to the final version of RFC 8748.
Compared with draft-0.12, the only meaningful change is in the namespace.
The rest is either schema-tightening that reflects actual usage, or
optional server-side features that we do not support.
We reuse draft-0.12 tests, only changing namespace uris in the input and
output files for the new version.
* Addressing reviews
We haven't been serving this for a while, let's finally get rid of them.
We keep some Soy rules around in the presubmits file because we use some
Soy files as XML templates for EPP actions.
This doesn't make any underlying implementation details, and is mainly useful to
reduce the number of diffs in PR #2852 (which does change implementation
details) thus making that easier to review.
This allows us to specify a getter delegation to bypass Hibernate's limitations
on field types for the purposes of, e.g., using a sorted set in toString()
output rather than the base Hibernate unsorted HashSet type.
BUG=http://b/448631639
This affects FKs pointing to both Contact and ContactHistory. This is in
preparation to us deleting all rows in those two tables, and then subsequently
removing all application logic having to do with contacts entirely.
This is steps one and two of b/454947209
We already haven't been serving WHOIS for a while, so there's no point
in keeping the old code around. This can simplify some code paths in the
future (like, certain foreign-key-loads that are only used in WHOIS
queries).
Basically, what happened is that the cache's expireAfterWrite was being
called some number of milliseconds (say, 50-100) after the transaction
was started. That method used the transaction time instead of the
current time, so as a result the entries were sticking around 50-100ms
longer in the cache than they should have been.
This fix contains two parts, each of which I believe would be sufficient
on their own to fix the issue:
1. Use the currentTime passed in in Expiry::expireAfterCreate
2. Use the transaction time in the cache's Ticker. This keeps everything
on the same schedule.
During the release process, we are seeing the message "Gradle build daemon disappeared unexpectedly (it may have been killed or may have crashed)" which seemingly can be caused by OOMs
I analyzed SQL statements run during the following flows and EXPLAIN
ANALYZEd each of them to figure out if there are any additional hash
indexes we could add that could be particularly helpful. Note: it's not
worth adding a hash index on the host_repo_id field in DomainHost
because so many rows (domains) use the same host.
- domain create
- domain delete
- domain info
- domain renew
- domain update
- host create
- host delete
- host update
I skipped the ones that use the read-only replica, as well as contact
flows (we're getting rid of them), and domain transfer/restore-related
flows as those are extremely infrequent.
Updates (AKA merges) run an extra SELECT statement to figure out if the
resource exists so that it can merge the entity into the existing object
in Hibernate's schema. When we're inserting new rows (such as new poll
messages or resource creates), we know that we don't need to do that
merge. Doing this should save us some SELECT statements (this has borne
out to be the truth in alpha)
This should help in instances of popular domains dropping, since we
won't need to do an additional two database loads every time (assuming
the deletion time is in the future).
When running the action in sandbox on 1.5M domains, it failed a few times
updating individual domains (requiring a manual restart of the entire action).
It's better to just log the individual failures for manual inspection and then
otherwise continue running the action to process the vast majority of other
updates that won't fail.
BUG = http://b/439636188
Add a flag to the CreateCdnsTld command to bypass the dns name format
check in Sandbox (limiting names to `*.test.`). With this flag, we
can create TLDs for RST testing in Sandbox.
Note that if the new flag is wrongly set for a disallowed name, the
request to the Cloud DNS API will fail. The format check in the command
just provides a user-friendly error message.
I went through all the SQL statements generated by some sample
DomainCreateFlow and DomainDeleteFlow cases to find situations where we
were either SELECTing from, or UPDATEing, tables with a direct "field =
value" format. These are the situations that I found where we can add
hash indexes. This does two things:
1. Makes these queries slight faster, since these are usually queries on
columns that are either unique or very close to unique, and O(1) is
faster than O(log(n))
2. Spreads around the optimistic predicate locks on the previously-used
btree indexes. Many of our serialization errors came from the fact
that we were autogenerating incrementing ID values for various
tables, meaning that SELECTs, INSERTs, and UPDATEs would all try to
take predicate locks out on the same page of the btree index. Using a
hash index means that the page locks will be spread out to various
index pages, rather than conflicting with each other.
Running load tests on alpha I see significant improvements in speed and
error rates. Speed is hard to quantify due to the nature of the way the
load tests distribute tasks among the queues but it could be more than
50% improvement, and serialization errors in the logs drop by more than
90%.
This implements the first part of Minimum Data Set phase 3, wherein we delete
all contact data. This action is necessary to leave a permanent record on the
domain (in the form of a domain history entry) documenting when the contacts
were removed by the administrative user.
Then, after this has finished removing all contact assocations, we can simply
empty out or drop the Contact/ContactHistory tables and associated join tables.
* Skip user loading for proxy service account
Reduces database load by skipping the User entity lookup for the proxy
service account during OIDC authentication.
The high volume of EPP "hello" and "login" commands from the proxy
service account results in a constant database load. These lookups
are unnecessary as the proxy service account is not expected to have a
corresponding User object.
This change optimizes the authentication flow by checking for the proxy
service account email *before* attempting to load a User from the
database. This bypasses the database transaction entirely for these
high-volume requests.
This approach is more efficient than caching, as it eliminates the
database lookup for the proxy service account altogether, rather than
just caching the result.
* comment added and service account llokup time improved
* comment updated for more clarity
* Add cache for User entities in OIDC auth flow
* refactor: Address review feedback
- Refactor database call into a single, reusable method
- Increase the default cache size to 200
- Remove .recordStats() and using spy for testing
- Split unit tests into separate implementation test that use Mockito spies instead of checking internal cache stats
We've moved these over to the User class, so we should remove these for
clarity. In addition, we should make it clear (in Java at least) that
the field in the RegistryLock object refers to the email address used
for the lock in question.
This allows us to also check / modify the CharlestonRoad registrar in
the console, and also allows us to test actions (like password reset)
using that registrar in the prod environment.
* Fix OOM in UploadBsaUnavailableDomains action
The action was using string concatenation to generate the upload content.
This causes an OOM when string length exceeds 25MB on our current VM.
This PR witches to streaming upload.
Also added an HTTP upload test.
* Fix OOM in UploadBsaUnavailableDomains action
The action was using string concatenation to generate the upload content.
This causes an OOM when string length exceeds 25MB on our current VM.
This PR witches to streaming upload.
Also added an HTTP upload test.
Error happened in the case that an unblockable name reported with
'Registered' as reason has been deregistered. We tried to check the
deletion time of the domain to decide if this is a transient error
that is no worth reporting. However, we forgot that we do not have
the domain key in this case.
As best-effort action, and with a case that rarely happens, we decide
not to make the optimization (staleness check) in thise case.
Note: this still includes "contacts" for registrars, which are actually
a different concept that we call RegistrarPoc. That's different from
"Contact" objects, e.g. registrant.
The TLD is technically valid but it doesn't exist for us -- we should
return 404 instead of 400 in these situations according to the RDAP
conformance docs
Load balancer / internal redirections can result in the final request
URL lacking "https" when finally getting to the servlet. As a result,
even if you use https in the request, the resulting URL can be plain
http.
We need to include the actual (HTTPS) URL in the output, so replace it.
This increases hikari fetch size to 40 from 20 in order to decrease the
amount of round trips
This also sets lower CPU as we seem to have overshot CPU consumtion
This also set min replicas to 8 for EPP and max to 16 as we've been
running on 8-10 for the last week
Tested locally and on alpha with dummy values (and throwing an
exception).
I was able to reuse a bit of code from the EPP password reset, but not
all of it.
This is the first in a series of PRs to implement the expiry access period
(XAP). The overall fee schedules will be set in YAML config files, so the only
DB change necessary should be this single new boolean column on the Tld entity,
which defaults to false so as to require XAP explicitly being turned on for a
given TLD.
BUG=http://b/437398822
Postgresql-17.6 introduces two new lines in pg_dump output as a
security feature: `\restricted {HASH}` and `\unrestricted {HASH}`.
We filter out lines starting with these two prefixes when comparing
schemas.
The db upgrade also adds two empty lines to the pg_dump output. We
know ignore all empty lines when comparing schemas.
It's necessary to remove the GAE-related code (and use GKE launch
commands instead), and we might as well remove contact-related fields
and actions because of the upcoming move to the minimum data set.
This reverts commit 5cef2dd8b5.
We faced CPU quota issue with standard machine type, so rolling back to c4
for now to monitor server performance and decide if we want to try to
downgrade again in the future.
We probably want this to run before the billing recurrence expansion
pipeline just in case there are any domains that should be deleted
before their billing recurrence gets expanded.
nodeSelector can limit scheduling capabilities of k8s, which leads to delays in assigning new workloads. Since we do not require and particular machine for execution it can be removed.
* Fix: Robustly parse certs and provide specific errors
* Add test for expired certificate failure
* fixing indentation
* fixing indentation
* Update SecurityActionTest.java
* Update SecurityActionTest.java for correcting the testcase
* Fix: Provide indentation fix
* Fixing Deduplication in test
ALl deployments received update to averageUtilization cpu. This should allow us to stay ahead of the curve of traffic and create instances before we cpu reached the limit.
Frontend cpu allocation has caused "noise neighbors" problem with pods assigned to nodes where there's not enough bursting capacity, so I increased it.
Adjusted rest of the deployments according to their utilization.
Missing resource requests as well as metrics for when to evict resource
produced situation when under load k8s struggled to assign pods. This
adds default resource requirements based on 2 weeks metrics and
instructions when resource should be evicted.
* Fix error handling in CopyDetailReportsAction
The action tries to record errors per registrar in an ImmutableMap, without realizing that
there may be duplicate keys due to retries.
Switched to the `buildKeepingLast` method to build the map.
* Addressing comments and rebase
Given an entity with auto-filled id fields (annotated with
@GeneratedValue, with null as initial value), when inserting it
using Hibernate, the id fields will be filled with non-nulls even
if the transaction fails.
If the same entity instance is used again in a retry, Hibernate mistakes
it as a detached entity and raises an error.
The work around is to make a new copy of the entity in each transaction.
This PR applies this pattern to affected entity types.
We considered applying this pattern to JpaTransactionManagerImpl's insert
method so that individual call sites do not have to change. However, we
decided against it because:
- It is unnecessary for entity types that do not have auto-filled id
- The JpaTransactionManager cannot tell if copying is cheap or
expensive. It is better exposing this to the user.
- The JpaTransactionManager needs to know how to clone entities. A new
interface may need to be introduced just for a handful of use cases.
It will now only throw errors on domain updates if a new contact/registrant has
been specified where none was previously present. This means that domain updates
on unrelated fields (e.g. nameserver changes) will succeed even if there is
existing contact data that the update is not removing.
This is a follow-up to #2781.
BUG=http://b/434958659
Some of these have been around since the Datastore days and are no
longer relevant (dealing with things like Datastore foreign keys). Let's
simplify things.
This works fairly similarly to the registry lock request and
verification mechanism. The request action generates a UUI which is
emailed (in link form) to the user in question. The frontend will send a
request to the verify action with the UUID and hopefully the action
should be finalized.
EPP password requests can be sent by anyone with edit-registrar
permissions and must be approved by an admin POC email.
Registry lock password resets can only be sent by primary contacts, and
are verified/performed by the user in question.
This prohibits all contact data on create and update EPP flows for both domain
and contact flows. It also refactors how default values on FeatureFlags work, as
it's safer to specify a single default on the flag itself rather than have to
specify it independently at a number of callsites (and potentially end up having
an inconsistent value). Domain updates on existing domains that still have
contact data will fail unless all contact data is removed, as a forcing function
to require registrars to rectify the situation prior to being able to do any
other kind of domain changes.
Contact-related flows that are still allowed after this point: Updating a domain
to remove all contacts from it, and deleting a contact object.
This corresponds to the Feb 2024 response profile section 1.2 and
implementation guide 1.3 respectively, now that we comply (or are, at
least closer to complying), with the Feb 2024 versions.
This should probably depend on https://github.com/google/nomulus/pull/2771
because that includes a small change included in the Feb 2024 version
This also updates the documentation to reference the proper areas of the
specifications.
From the response profile:
2.4.6. Registrar URL - The entity with the registrar role in the RDAP response
MUST contain a links member [RFC9083]. The links object MUST contain
the elements: value, identical to the the RDAP Base URL for the
Registrar as provided in the IANA “Registrar IDs” registry (i.e.,
https://www.iana.org/assignments/registrar-ids); rel:about, and href
containing the Registrar URL. Note: in cases where the Registry Operator
acts as sponsoring Registrar (e.g., IANA Registrar ID 9999), the href shall
contain a URL from the Registry.
This is necessary so that the total number of requests/responses adds up
correctly even though some fraction of them are only being recorded. It uses
stochastic rounding so that the totals add up correctly even when the reciprocal
of the ratio isn't an integer.
This is a follow-up to PR #2772.
This ratio defaults to 1.0 (i.e. all metrics will be recorded), but we will set
it much lower in sandbox and production, probably something closer to 0.01. This
will reduce recorded metrics volume and thus StackDriver cost, while still
retaining enough data for overall performance monitoring.
This is handled stochastically, so as to not require any coordination between
Java threads or GKE pods/clusters, as alternative approaches would (i.e. using a
counter and recording every Nth, or throttling to a max metrics qps).
In RDAP, domain queries are the most common by a factor of like 40,000
so we should optimize these as much as possible. We already have an EPP
resource / foreign key cache which does improve performance somewhat but
looking at some sample logs, it only cuts the RDAP request times by like
40% (looking at requests for the same domain a few seconds apart).
History entries don't change often, so we should cache them to make
subsequent queries faster as well. In addition, we're only caching two
fields per repo ID (modification time, registrar ID) so we can cache
more entries than we can for the EPP resource cache (which stores large
objects).
Pubapi actions should always use cache, regardless of the config
settings on caching.
In EppResource.java, the original `loadCached(Iterable<VKey>)`
method is renamed to `loadByCacheIfEnabled`. The original
`loadCached(Vkey)` method is renamed to `loadByCache` and always
uses cache.
In EppResourceUtils.java, the original `loadByForeignKeyCached`
method is renamed to `loadByForeignKeyByCacheIfEnabled`. A new
`loadByForeignKeyByCache` method, which always uses cache.
In ForeighKeyUtils.java, the original `loadCached` method is
renamed to `loadByCacheIfEnabled`, and a new `loadCached` method
is added which always uses cache.
Also added a `getContactsFromReplica` method in Registrar,
for use by RDAP actions.
A future PR will add the actions that save and use this object. That
future PR will also require loading RegistrarPoc objects given the
registrar ID, hence the change in that class.
This implements two type of changes:
1. changing the link type for things like the terms of service
2. adding the request URL to each and every link with the "value" field.
This is a bit tricky to implement because the links are generated in
various places, but we can implement it by adding it to the results
after generation.
See b/418782147 for more information
Add a new DnsWritersModule for use by the component classes.
To override the set of writers installed, we can easily overwrite this
file with a private version.
This is necessary because we'll use primary-contact emails as a way of
resetting passwords.
In the UI, don't allow editing of email address for primary contacts,
and don't allow addition/removal of the primary contact field
post-creation.
In the backend, make sure that all emails previously added still exist.
We're changing the way that allocation tokens work in suboptimal (i.e. incorrect) situations in the domain check, creation, and renewal process. Currently, if a token is not applicable, in any way, to any of the operations (including when a check has multiple operations requested) we return some variation of "Allocation token not valid" for all of those options. We wish to allow for a more lenient process, where if a token is "not applicable" instead of "invalid", we just pass through that part of the request as if the token were not there.
Types of errors that will remain catastrophic, where we'll basically return a token error immediately in all cases:
- nonexistent or null token
- token is assigned to a particular domain and the request isn't for that domain
- token is not valid for this registrar
- token is a single-use token that has already been redeemed
- token has a promotional schedule and it's no longer valid
Types of errors that will now be a silent pass-through, as if the user did not issue a token:
- token is not allowed for this TLD
- token has a discount, is not valid for premium names, and the domain name is premium
- token does not allow the provided EPP action
Currently, the last three types of errors cause that generic "token invalid" message but in the future, we'll pass the requests through as if the user did not pass in a token. This does allow for a default token to apply to these requests if available, meaning that it's possible that a single DomainCheckFlow with multiple check requests could use the provided token for some check(s), and a default token for others.
The flip side of this is that if the user passes in a catastrophically invalid token (the first five error messages above), we will return that result to any/all checks that they request, even if there are other issues with that request (e.g. the domain is reserved or already registered).
See b/315504612 for more details and background
We suspected this could be a cause of optimistic locking failures
(because long transactions would lead to optimistic locks not being
released) but this didn't end up being the case. Let's remove this to
reduce log spam.
1. This doesn't remove the SQL tables yet (this is necessary to pass
tests and also good practice just in case we need or want to look at
history for a little bit)
2. This also removes the Registrar, RegistrarPoc, and User base classes
that were only necessary because we were saving copies of those
objects in the old history classes.
We no longer need to union GKE+GAE logs since we've moved all production
traffic to GKE only.
For testing, I copied the affected *_test.sql files to Bigquery, removed
all the "-alpha" bits, and changed the dates to 20250301 and 20250331
and ran them to make sure they returned the expected data.
Now that we have effective global sessions thanks to #2734, there is no
longer a need to keep the number of pods on the EPP service static.
We are also not vulnerable to random pod restarts. K8s never guarantees
perpetual pod lifetime anyway, and not having to be at its mercy is
certainly a relief.
It turns out period can be used in the URI, such as in
"urn:ietf:params:xml:ns:fee-0.12". I don't think pipe is used, at least
not according to EPP URI namespace naming convention.
Ideally we'd use serialization, but using the default serialization runs
the risk of it being platform/JDK dependent, so a new deployment might
not be able to deserialize existing cookies. A custom serializer that
guarantees stability would have been needed.
This was added back in early 2018 long ago to enable promotions, but
since then (and for many years) we've added the ability to run
promotions on the tokens themselves, rather than relying on custom Java
classes.
This will make the changes for b/315504612 much easier, as that will
split up token validation into "is this token valid in general?" and "is
this token valid for this domain/action?"
This can potentially help even more with serializable transaction
failures (optimistic locking exceptions, which are expected to occur
somewhat frequently).
With six attempts, we will sleep at most five times, for
100+200+400+800+1600 ms each, for a total of at most 3.1 seconds (much
less than the EPP maximum which I believe (?) to be 30 seconds.
In addition, we add a 20% skew in an attempt to spread out
possibly-conflicting transaction retries.
This changes the code to only save console histories of this type. We
keep the old Java code (and, necessarily, the corresponding SQL code)
for now because there's no harm in doing so and we want to avoid hastily
deleting too much.
The SQL statement was incorrectly flooring to zero one layer too deep, which was
negating all negative transaction report rows (which occur most frequently when
a domain in the autorenew grace period is deleted). I've changed it so that it
now only floors to zero at the report level, which still solves the issue
reported in http://b/290228682 but whose original fix caused the issue
http://b/344645788
This bug was introduced in https://github.com/google/nomulus/pull/2074
I tested this by running the new query against the DB for 2024 Q4 using the
registrar that was having issues and confirmed that the total renewal numbers
for .app now match with the sum total of what we invoiced for the last three
months of 2024.
This doesn't check for correctness (we have other scripts that do that)
but just that the service is available at all (the other scripts do not
do that).
This should, and will, be configured with a scheduled trigger in GCB (for us, in
the domain-registry-dev project) and configuration to send some sort of
pub/sub notification on failure (for us, this is already set up on
domain-registry-dev and it sends messages to the "Domain Registry
Notifications" chat channel.
We've seen this issue happen more often than not recently, where GAE
canary deployment is stuck for about 10 min and the failed. The reason
is not clear, but delete the canary version prior to a deployment always
fixes the issue.
We use the $environment property to set the console config. If it is not
given, 'alpha' is used, which has the same effect as 'production'.
TESTED=ran :jetty:copyConsole with
-Penvironment=(sandbox|production|alpha) and checked the resulting js
file.
For GKE all tasks should be on backend, BSA was on its own service
because of egress IP constraint.
Also made it possible to specify a timeout for the Cloud Scheduler job,
with the default (3m) suitable for most tasks.
There are two session cookies, JSESSIONID, which is set by Jetty, and
GCLB, which is set by the Gateway.
In one session, every request other than the first one (the <hello>)
should have the same GCLB value, and every request after a successful
<login> should have the same JSESSIONID.
With these two metadata, we should be able to trace all requests that
*should* belong to the same session and debug issues with session
mismatch (if any).
There can be delays in releasing predicate locks when we have
transactions that are long-lived -- even delays in releasing predicate
locks acquired by shorter-lived transactions. Logging the transaction
duration will allow us to get a sense as to transaction durations during
busy times.
This can happen on public endpoints (in pubapi) where the service is
behind IAP but all users (including not-logged-in ones) are allowed. IAP
will add an OIDC token with no email field in the request header.
This removes logic from an inner helper method so that it becomes more clear
from callsites within each test exactly which behavior is expected from those
test conditions.
We now pin to postgreSQL v17 when running tests, which means that minor
version might increase without our intervention. This causes (at least)
the comment in the golden schema to change, and failing the test as a
result.
This PR adds the ability to strip lines that we deem as comment from the
comparison, so we don't have to do trivial upgrades to the gold schema
whenever there's minor version upgrade.
This allows us to easily tell which tag was deployed.
Also set the gateway to use named address so they are stable, and so
that we can attach an IPv6 record to it. Auto-provisioned addresses are
IPv4 only.
Failed pg_dump may not leave a file, failing the subsequent diffing and
causing the verifier to return success.
The verifier should abort in this case.
GKE logs are routed to a different dataset and the table is different.
The structs to look for are also different (jsonPayload vs textPayload
or protoPayload).
TESTED=Ran the resulting query in crash.
I obtained access to an IBM s390x VM so I thought I'd see how multi-arch
Nomulus is.
Our main application is in Java so it is already multi-arch, but several
tests use docker images that are by default x64. Luckily postgres has an
s390x port, but selenium does not. So I had to disable Screenshot tests
when the arch is not amd64.
This isn't a situation we'll encounter often, but if the client has an
allocation token that's valid for premium domains that gives a 0 cost,
we shouldn't require them to include the fee extension when creating the
domain. We already don't require it for standard domains.
We use this now for the DomainDeleteFlow in an attempt to figure out
what statements it's running (cross-referencing that with PSQL's own
statement logging to find slow statements).
This is kinda nonsensical because this use case is trying to apply a
single use token multiple times in the same domain:check request --
like, trying to use a single-use token for both create, renew, and
transfer while having a $0 create price and a premium renewal price.
This change doesn't affect any actual business / costs, since SPECIFIED
token renewal prices were already set on the BillingRecurrence
Instead of using a separate RenewalPriceInfo object, just map the
behavior (if it exists) onto the BillingRecurrence with a special
carve-out, as always, for anchor tenants (note: this shouldn't matter
much since anchor tenants *should* use NONPREMIUM renewal tokens anyway,
but just in case, double-check).
This also fixes DomainPricingLogic to treat a multiyear create as a
one-year-create + n-minus-1-year-renewal for cases where either the
creation or the renewal (or both) are nonpremium.
We are required to respond to HTTP(S) requests on port 80/443 on the
same domain where we serve port 43 WHOIS requests. The proxy already
does this by redirecting to the web WHOIS lookup page on the marketing
website.
This PR makes it so that requests to port 80/443 can be routed to the
proxy for redirect.
TESTED=tested on crash and the redirect works.
SecretManager based keyring not included in keyring bindings, resulting
in runtime failure.
We should simply keyring bindings. There is no use case for multiple
implementations. See b/388835696.
Knonw nested transact calls found in sandbox have been refactored away.
Enable logging in production to catch any missing cases. Logging is
throttled at 1 message per minute per VM.
k8s does not have a way to expose a global load balancer with TCP
endpoints, and setting up node port-based routing is a chore, even with
Terraform (which is what we did with the standalone proxy).
We will use Cloud DNS's geolocation routing policy to ensure that
clients connect to the endpoint closest to them.
- We never delete rows from DomainHistory (and even if we do in the
future, they'll be old / the references won't matter)
- This is likely creating lock contention when lots of requests come
through at once for domains with many DomainHistory entries
There are some breaking method changes in the 10.x.y versions and we're encountering exceptions when trying to run the flywayMigrate task thanks to those.
For now it only includes two options (domain deletion and domain
suspension). In the future, as necessary, we can add other actions but
this seems like a relatively simple starting point (actions like bulk
updates are much more conceptually complex).
This might help alleviate DB transaction contention on the PollMessage table. A
lower transaction isolation level is safe because acking a poll message is
idempotent: there are only two things it does, either delete a poll message or
take a recurring one from the past and set it to be a year in the future from
the date in the past. Both of these operations will always yield the same final
result even if executed multiple times simultaneously for some reason.
It doesn't need a higher transaction isolation level as it's only loading a given poll
message once, and we want to avoid putting any kind of locks on the PollMessage table
as it seems to be having contention issues. Note that the poll message request flow
is by far the most frequent code that touches the PollMessage table, as there are many
many requests every minute from dozens of registrars, but much fewer poll messages
than that to actually ACK.
We only include the deletion time if the domain is in the 5-day
PENDING_DELETE period after the 30 day REDEMPTION period. For all other
domains, we just have an empty string as that field.
This is behind a feature flag so that we can control when it is enabled
It was failing when any kind of transfer data was present, even completed
transfer data. Note that completed transfer data persists on a domain
indefinitely until/unless a new transfer is requested.
BUG= http://b/377328244
It seems like `/usr/bin/python` is no longer symlinked to the `python3`
binary in the `gcr.io/cloud-builders/git` image.
I've sent out a separate fix to upstream to change the shebang.
https://gerrit-review.git.corp.google.com/c/gcompute-tools/+/439501
But in the meantime, we need this temporary fix for the release to
build.
These flags are suggested by GitHub support to disable reusing caches
during Gradle build. They think that could fix the intermittent error
message:
```
Encountered a fatal error while running "/opt/hostedtoolcache/CodeQL/2.19.0/x64/codeql/codeql database finalize --finalize-dataset --threads=4 --ram=14576 --verbosity=progress++ /home/runner/work/_temp/codeql_databases/java". Exit code was 32 and last log line was: CodeQL detected code written in Java/Kotlin but could not process any of it. For more information, review our troubleshooting guide at https://gh.io/troubleshooting-code-scanning/no-source-code-seen-during-build . See the logs for more details.
```
This fails for me every day for some reason (starting about a month
ago). The same commit went through the workflow fine when the action was
triggered by a push.
I think there's no reason for us to have a cron run as the changes to the
master branch can only come from commit pushes.
Newer versions of GPG (v.2.4.5 in my case) has uses different wording
then what's available in our build image (and Ubuntu I suspect). For
example it says "rsa2048" instead of "2048-bit RSA".
Make the tests work in both cases. Admittedly we cannot check for the
string RSA/rsa easily, but I don't think it matters much for tests.
It looks like /usr/bin/python *may* no longer exists in the latest cloud
builder git image. I ran the latest image and logged into it to verify
that /usr/bin/python3 does exist on 9/25, and again on 9/26 where it
re-appeared.
I think it is generally a good idea to not rely on it being there going
forward.
* Include discount price in domai n pricing
* Partial progress in logic
* Tests and logic passing
* Change pricing for multi year create
* Tests for discount pricing logic
* Token currency check
* Add some comments
* Java formatting
* Discount price to Optional
* Change discount price to be optional nullable
* Re-add deleted tests
Depending on if a "--gke" parameter (must be the last one) is passed,
the deployer constructs the corresponding URIs for GAE or GKE
accordingly.
TESTED=Used the deployer to deploy tasks to alpha and verified that they
run on GKE.
* Drop the `transact` call in Id services
All usages already routed through `tm().allocateId()`, which is
guaranteed to be in a transaction.
* Addressing reviews
Several jars in our dependencies are now multi-release, including
dnsjava and snakeyaml, and a few more. Such jars include
jvm-version-specific classes that will only be loaded by the vm that can
handle them. All it takes is a new manifest attribute.
This change allows us to upgrade to dnsjava3.6+: the base (java 8) version of
this jar breaks java21. The correct manifest allows java21 to find the
classes it needs.
This single metric currently accounts for 22.2% of our total metrics bill,
almost double the size of our EPP requests metric, while also simultaneously
being much less useful. This change reduces the cardinality by removing two
parameters we don't care that much about, which should significantly reduce the
size and thus the cost. If after this change the metric is still too large, I'll
also then remove the matchCount parameter from this metric. We could possibly
even consider deleting the metric in its entirety, as we hardly ever use it.
This PR also removes unused code for premium list metrics that have never
actually been written out (and that we won't bother with at this point).
This will help us if/when we need to run the report generation multiple
times, or for past dates and we don't want to send extra emails or
upload any extra reports to ICANN.
Instead of having to parse the protoPayload.line from the request logs,
we just want to inspect the textPayload from the app logs (stored in a
separate table). This applies to the EPP metrics from the activity
reporting and the attempted-adds column for the transaction reporting.
This doesn't really add any tests, and we'll require many more additions
if we actually want to have full unit testing, but this at least makes
the tests pass when running `npm test`.
Previous PRs and token changes (see b/332928676) have made it so that
SPECIFIED renewalPriceBehavior tokens must have a renewal price. As
such, we can now use that renewalPrice when creating domains with
SPECIFIED tokens.
* Add QPS and incomplete connections metrics to load test client
* Add a failed request count
* Add todos
* Reuse contact
* Add bugs to todos
* small fix
* Clarify QPS
It's valid for the auth data to be null (although it only happens 10 times
across our entire registry), so the domain update flow should not fail out with
a NullPointerException when the existing state of the data is null and the
update isn't adding that data either.
BUG=http://b/359264787
This is the first step in the field removal (second will be removing the
column from SQL once this is deployed).
There's no point in using a UserDao versus just doing the standard
loading-from-DB that we do everywhere else. No need to special-case it.
When creating/deleting users, we need to add/remove the emails in
question to/from the console email group (if it exists). This used to be
done synchronously by calling the Groups API directly from the nomulus
tool. However #2488 made it so that in all cases where group membership
is modified, a Cloud Tasks task is created to execute the change on
the server side asynchronously (because there are multiple places where
this change needs to be done, and it is easier to make it all happen on the
server side).
Alas, as it turns out, Cloud Tasks tasks need to be created with a
service account's credential (which is trivially done on the server side
because the ADC is a service account). Nomulus command runs with a user
credential, and we need to grant the relevant user permission to
masquerade as a service account, in order to enqueue tasks from the
nomulus tool. It is therefore easier to just revert to the old behavior.
Originally, we though that User entities were going to have mutable
email addresses, and thus would require a non-changing primary key. This
proved to not be the case. It'll simplify the User loading/saving code
if we just do everything by email address.
Obviously this doesn't change much functionality, but it prepares us for
removing the id field down the line once the changes propagate.
For instance, on sandbox this will allow us to remove our global roles
but keep roles to the CharlestonRoad admin registrar. Then, when we view
the console, it will be as if we were a registrar user.
This is consistent with how other registries are handling RDAP and is also consistent
with overall behavior in WHOIS and domain info flows as implemented in my previous
PRs #2477 and #2490.
* Check FeatureFlag in domain flows before checking contacts
Check if phase 1 has begun of the transition to the minimum registry dataset, and if it has, do not require the presence of contacts in domain flows.
* Add tests
* Small test fixes
* rename flag
* Fix merge conflicts
* Change todo
* Add isActive methods
* Add javadocs
* small fix
As requested, for registrars participating in these tiered pricing
promos that wish to receive this type of response, we make the following
changes:
1. The pre-promotional (i.e. base tier) price is returned as the
standard domain-create fee when running a domain check.
2. The promotional (i.e. correct) price is returned as a special custom
command class with a name of "STANDARD PROMO" when running a domain
check
3. Domain creates will return the non-promotional (i.e. incorrect) price
rather than the actual promotional price.
This PR does only number 3. See PR #2489 for the others.
As requested, for registrars participaing in these tiered pricing promos
that wish to receive this type of response, we make the following
changes:
1. The non-promotional (i.e. incorrect) price is returned as the
standard domain-create fee when running a domain check.
2. The promotional (i.e. correct) price is returned as a special custom
command class with a name of "STANDARD PROMO" when running a domain
check.
3. Domain creates will return the non-promotional (i.e. incorrect) price
rather than the actual promotional price. This is not implemented in
this PR.
Our logs are getting gummed up with an indefinitely failing and retrying task to
re-save a prober domain that doesn't exist (likely because it was hard-deleted
by delete prober data action), so this makes the re-save action resilient to
that failure case so that it stops assuming every enqueued re-save actually
corresponds to an entity that exists, thus allowing it to fail permanently if
the entity doesn't exist. Failing permanently is the right thing to do as if
the entity doesn't exist now there's no reason to think it will in the future,
plus all re-saves are optimistic rather than guaranteed anyway.
This should fix http://b/350530720
* Add registryTool commands for FeatureFlags
* Fix merge conflicts
* Add required parameters and inject mapper
* Use optionals in cache to negative cahe missing objects
* Fix spelling
* Change back to bulk load in cache
* Add FeatureName enum
* Change variable name
* Use FeatureName in main parameter
This doesn't yet allow them to be absent in EPP flows, but it should make the
code not break if they happen to be null in the database. This is a follow-up to
PR #2477, which ends up being a bit easier because whereas the registrant is
used in more parts of the codebase, the other contact types (admin, technical,
billing) are really only used in RDE, WHOIS, and RDAP, and because they were
already being used as a collection anyway, the handling for if that collection
contains fewer elements or is empty happened to already be mostly correct.
When using this token (which must be tied to a particular domain), the
first year price (and only the first year price, i.e. the creation
price) will be the standard price for this TLD. Future years (i.e.
renewals) will continue at the normal premium price.
This isn't the worst thing in the world but it does result in a bad
request to the server otherwise, and log/error spam. So, only load the
domains list if we have a registrar selected.
This is the first step in migrating to the minimum registration data set. Note
that our database model already permits null domain registrants, so this just
makes the code accept it as well. Note that I haven't changed any requirements
in EPP flows yet; a later step will be to check the migration schedule and then
not require the registrant to be present if in a suitable state.
This does potentially affect the output of WHOIS/RDAP, but that's a NOOP so long
as EPP commands and other tools continue to enforce the requirement of a
registrant.
This is an optional field (will be required when the renewal price
behavior is SPECIFIED). This will allow us to set arbitrary renewal
prices for domains as part of one-off negotiations.
https://b.corp.google.com/issues/332928676
This is the last remaining GAE API that we depend on. By removing it, we are able to remove all common GAE dependencies as well.
To merge this PR, we need to create console User objects that have the same email address as the RegistrarPoc objects' login_email_address and copy over the existing registry lock hashes and salts.
We are also able to simply the code base by removing some redundant logic like AuthMethod (API is now the only supported one) and UserAuthInfo (console user is now the only supported one)
There are several behavioral changes that are worth noting:
The XsrfTokenManager now uses the console user's email address to mint and verify the token. Previously, only email addresses returned by the GAE Users service are used, whereas a blank email address will be used if the user is logged in as a console user. I believe this was an oversight that is now corrected.
The legacy console will return 401 when no user is logged in, instead of redirecting to the Users service login flow.
The logout URL in the legacy console is changed to use the IAP logout flow. It will clear the cookie and redirect the users to IAP login page (tested on QA).
The screenshot changes are mostly due to the console users lacking a display name and therefore showing the email address instead. Some changes are due to using the console user's email address as the registry lock email address, which is being fixed in Add DB column for separate rlock email address #2413 and its follow-up RPs.
GKE now displays log messages correctly. There is no need for an extra
leading newline, which now results in a useless blank line for each log
entry in Log Explorer.
* Create a load testing EPP client.
This code is mostly based off of what was used for a past EPP load testing client that can be found in Google3 at https://source.corp.google.com/piper///depot/google3/experimental/users/jianglai/proxy/java/google/registry/proxy/client/
I modified the old client to be open-source friendly and use Gradle.
For now, this only performs a login and logout command, I will further expand on this in later PRs to add other EPP commands so that we can truly load test the system.
* Small changes
* Remove unnecessary build dep
* Add gradle build tasks
* Small fixes
* Add an instances setUp and cleanUp script
* More modifications to instance setup scripts
* change to ubuntu instance
* Add comment to make ssh work
For reasons unclear to me, if the stack trace is appended directly to
the message, the log entry will be lumped together with following logs
on GKE.
Also updated the GKE service account for Nomulus in the manifest so we
can use workload identity just for Nomulus, not other pods on the same
cluster.
There are a bunch of cases where we want common exception handling and
it's annoying to have to deal with the common "set failed response and
make sure to return" a bunch of times.
This allows us to break up request methods more easily, since we can now
often throw exceptions that will break all the way back up to
ConsoleApiAction. Previously, any error handling had to exist in the
primary handler method so it could return.
The user, on the front end, should not be required to provide whether or
not they're trying to verify a lock or an unlock. They should only need
the verification code. We can inspect the lock object itself (and the
domain in question) to see whether or not we're verifying a lock or an
unlock.
We've added the field in the database in a previous PR. This is only
used in the old console for now because the new console does not have
registry lock functionality yet
* Remove the createBillingCost field from Tld
* fix spacing
* Change field name of map
* Rename getter
* Fix formatting
* Fix todo
* unchange column name
* Add log traces to Nomulus service on GKE
Add request-scope log traces to Nomulus on GKE which, unlike
AppEngine and Cloud Run etc, does not generate traces for hosted
applications. This change only affects the GKE image. It does not affect
the AppEngine services.
Log traces are added to Nomulus-generated logs in request-processing
threads. Forked threads are not covered yet. The single relevant use
case (TimeLimiter) will be addressed in a followup PR.
The main change is in the logging configuration:
* Use gcp-cloud-logging's LoggingHandler
* Add gcp-cloud-logging's TraceLoggingEnhancer to the handler.
* Set a thread-local trace id through the TraceLoggingEnhancer in
ServletBase on request's entry and clear it on completion.
Also removed an unused class (`RequestLogId`).
* CR
* CR
This handles both GET and POST requests. For POST requests it doesn't
actually change anything about the domains because we will need to add a
verification action (this will be done in a future PR).
* Avoid contention over the RefreshDnsRequest table
This table can be small at times, when PSQL may use table scan in
queries by keys. At the SERIALIZABLE isolation level, updates to
unrelated rows may conflict due to this `optimization`.
Lower the isolation level to repeatable read.
* Code review
This includes using the new switch format (though IntelliJ does not yet
understand patterns including default so those aren't used), multiline strings,
replacing some unnecessary type declarations with <>, converting some classes to
records, replacing some Guava predicates with native Java code, and some other
miscellaneous Code Inspection fixes.
This adds an index on transfer_billing_cancellation_id to Domain and superordinate_domain to Host. When tested on crash with the action limited to only delete 10,000 domains, before these indexes were added the action took about 2 hours to delete 10,000 domains. Once these indexes were added, the action was able to delete the 10,000 domains in a little under 2 minutes.
This also creates base classes for the objects contained within the
history classes, e.g. RegistrarBase. This is the same way that objects
stored in the HistoryEntry subclasses have base classes, e.g.
DomainBase.
We cannot rely on the user checking their login email, so we'll want to
send the emails to the other address if configured. This is already the
case in RegistrarPoc.
Use REPEATABLE READ for lock acquire/release operation to avoid conlicts
between locks.
Postgresql uses table scan on small tables, causing false sharing at
the SERIALIZABLE isolation level.
See b/333537928 for details.
Console users need IAP to inject the necessary OIDC tokens into their
request headers and therefore need to be bound to appropriate roles. Note
that in environments managed by latchkey, the bindings will need to be
present in latchkey config files as well, otherwise the changes made by
the nomulus tool will be reverted.
TESTED=ran the nomulus command against alpha and verified that the
bindings are created/removed upon console user creation/deletion.
The existing behavior was to ignore bad header names, in a way that was
counter-intuitive as a user of the Google Sheet. If a header name was bad (which
could just be someone accidentally changing it not realizing it needs to
correspond exactly to the name of the field on the Java object), then all of the
data in that column was just silently left as-is and never updated. This led to
gradually worsening sync and offset shift errors over time.
Now, it will write out an error message into every single cell in the bad
column, so it's clear that the column name is wrong and does not correspond to any
actual data in the DB.
BUG=http://b/332336068
As a read-only action that tolerates staleness, locking is unnecessary.
This should help with the lock contention we are observing.
Also reduces the number of VM instances provisioned for BSA and increase
the idle timeout. This should reduce invocation delay. Longer delay may
cause AppEngine to return `Timeout` status to Cloud Scheduler even
though the cron job succeeds.
This also involved breaking out an improperly done assertThat() helper overload
method for JsonObjects into a proper Subject that doesn't further overload
assertThat().
* Check for missing BSA unblockable domains
All unblockable domains created before the last refresh run should be
reported as unblockable (registered).
All reserved domains that are not registered should be reported as
unblockable (reserved). Note that transient errors may be reported for
newly added reserved domains since we do not maintain update time for
when a reserved label is associated with a TLD. However, this scenario
is extremely rare in operations.
* Addressing review
* Add batching to DeleteProberDataAction
* Only get time once
* Add separate query for dry run
* Update querries to actually properly delete all the data
* Fix merge conflicts
* Add test for foreign key constraints
* Make transaction repeatable read
* Make queries to subtables native
* Add native query for GracePeriodHistory
* Kill job after 20 hours
* remove extra time check from read query
* Add index for domainRepoId to PollMessage and DomainHistoryHost
* Add flyway fix for Concurrent
* fix gradle.properties
* Modify lockfiles
* Update the release tool and add IF NOT EXISTS
* Test removing transactional lock from deploy script
* Add transactional lock flag to actual flyway commands in script
* Remove flag from info command
* Add configuration for integration test
* Verify unblockables are truly unblockable
Unblockable domains may become blockable due to deregistration or
removal from the reserved list. The BSA refresh job is responsible
for removing them from the database. This PR verifies that the refreshes
are correct.
Note that recent changes since last refresh are not reflected in the
result, and inconsistency due to recent deregistrations are ignored.
Changes in reserved status or IDN validity are not timestamped,
therefore we cannot ignore recent inconsistencies. However, these
changes are rare.
* Addressing code review
* Addressing code review
Upgrade to using Jakarta EE 10 from Java EE 8 by mostly following the upgrade instructions. Only the servlet package is upgrade. Other Jakarta EE components (like the persistence package that Hibernate depends on) need to be upgraded separately.
TESTED=deployed and successfully communicated with the pubapi endpoint for web WHOIS.
Note that this currently requires packaing the App Engine runtime per instructions here due to GoogleCloudPlatform/appengine-java-standard#98. This PR will only be merged until the fix is deployed to production (https://rapid.corp.google.com/#/release/serverless_runtimes_run_java/java21_20240310_21_0).
We don't use the upload results feature (kokoro picks the results
artifacts directly and uploads them).
Keeping it around is a maintenance burden.
Also fixed a deprecation warning.
Labels that are not in any supported IDN are not added to the database.
Remove such labels from those loaded from the block list files before
comparing with DB.
There is one remaining instance in JpaTransactionManagerImpl that cannot
be removed because DetachingTypedQuery is implementing TypedQuery, which has
a method that expectred java.util.Date.
Console tests fail for the files that are affected by redesign. There's no point in fixing it here. I will reenable the task after the console redesign PR is merged
Note that Dagger currently doesn't work with the Jakarta namespace and
we have to cap the jakarta inject package version below 2.0 so that it
sill provides classes in the old namespace.
Removed the deprecation mark as it is natural to expose methods related
to a transaction like getting the entity manager or checking if one is
in a transaction through the transaction manager interface.
Also only caches/resets the original TM when in unit tests (TBT I'm not so sure
that even this is necessary as we don't seem to call the tool from tests
that often. There is only ShellCommandTest that calls the run() function
in RegistryCli and we could just put these tests in fragileTest and make
them run sequentially and fork every time to get around issue with
inference).
The issue with caching is that it tries to first create the to-be-cached
TM, and when the environment given is prod/sandbox/... It will try to
retrieve SQL credentials from prod/sandbox/... secret manager. This
works fine locally as we all have access to prod/sandbox/..., but fails
in Cloud Build jobs such as sync-db-objects where it provides it own
credential that has direct SQL access, but not access to
prod/sandbox/... secret manager.
TESTED=ran `./gradlew devTool --args="-e localhost generate_sql_er_diagram -o ../db/src/main/resources/sql/er_diagram"`
* Add a GitHub Action workflow
This allows us to create Gradle depedency graphs for Dependabot analysis (as the ones we already get for Javascript dependencies).
* Update Java version
* Add build scan
* codeql 3
* run with gradle
* exclude jIFC
* build scan
* Finalize
Chose the default transaction manager based on RegistryEnvironment. This
makes it possible to run nomulus on Jetty locally. Tested with the
following:
```bash
./gradle :jetty:run -Penvironment=alpha
curl http://localhost:8080/beta.app
```
The docker image is also updated to take an argument that specifies the
environment. It runs locally as well but the container doesn't get
access to locally stored credentials, so it fails to initialize the
transaction manager.
* Add BSA validation job
Add the BsaValidateAction class with a first check (for inconsistency
between downloaded and persisted labels).
* Addressing comments
* Addressing reviews
The staging job runs at 9AM on the 2nd day of each month, we should set
the cursor to be after that time, otherwise we attempt to upload reports
on the 1st day of each month before they are ready, causing an error
email to be sent to us.
Remove email classes that depend on AppEngine API. They have been
replaced by the gmail-based client.
Remove `EmailMessage.from` method, which is no longer used.
There is a fixed sender address for the entire domain, and is
set by the gmail client.
The configs remain to be cleaned up. There is a bug (b/279671974) that
tracks it.
This PR makes the runtime of most of our workload Java 21.
1. App Engine. Java 21 is in GA and it supports Java EE 8. I had to add
an environmental variable so that we don't get an
AppEngineCredentails by default (we have been using
ComputeEngineCredentials for a couple of years). The uprade to Java
21 runtime changed a system property that controls how jetty logging
works, which also control if AppEngineCredential is return. Tested by
deploying to alpha.
2. Proxy base image upgradedd to Java 21 (distroless still doesn't
support Java 21 and it looks like Temurin is the way to go
b/306728455). Tested by deploying to alpha.
3. Nomulus tool image upgrade to Temurin 21 as well. Tested locally.
4. Beam pipeline base image upgrade to Java 21. The JAVA21 flag is not
supported by gcloud yet, but specifying the image URL directly works
(and is supported). Tested by running in alpha.
5. Jetty base image upgraded to Java 21. Tested locally.
Make the necessary changes for the code base to compile with JDK 21.
Other changes:
1. Upgraded testcontainer version and the SQL image version (to be the
same as what we use in Cloud SQL). This led to some schema changes and
also changed the order of results in some test queries (for the
better I think, as the new order appears to be alphabetical).
2. Remove dependency on Truth8, which is deprecated.
3. Enable parallel Gradle task execution and greatly increased the
number of parallel tests in standardTest. Removed outcastTest.
This PR creates a unified RegistryServlet that will serve all
non-console traffic. It also creates a jetty subproject that allows one
to run Nomulus on top of a standard Jetty 12 runtime.
`./gradlew :jetty:stage` will create a jetty base folder at
`jetty/build/jetty-base` where one is able spin up a local Nomulus server
by running the following command inside the folder:
```bash
java -jar ${JETTY_HOME}/start.jar
```
`JETTY_HOME` is a folder where the [Jetty runtime](https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-home/12.0.6/jetty-home-12.0.6.zip) is located.
This PR also adds a Gradle task to create a Nomulus image based on the
official Jetty image:
```bash
./gradlew :jetty:buildNomulusImage
```
This reverts commit 64f5971275.
The catch block is too broad and most of the times the errors caught is
because `command.run()` failed and it had nothing to do with getting
the transaction manager. The `runCommand` method is already wrapped in a try
block that checks for `LoginRequiredException` and gives the appropriate
error message.
We need to re-assess the situation when the next time we encounter a
login issue that did not trigger `LoginRequiredException`. A blanket try
catch block is not the solution and only makes the situation more
confusing.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/2342)
<!-- Reviewable:end -->
* Add a check so newly saved createCostTransitions get recognized and saved to the database
* Fix equals check
* Rename equals method
* Add comment explaining need for createBillingCostTransitionEqualCheck
* Remove use of shouldPublishField from ReservedList
* Remove from tests
* Update test comment
* Fix indentation
* fix test comment
* Fix test
* fix test
* Make shouldPublish column nullable
We introduced Scrypt as the default password hashing algorithm in
November 2023 and have been auto-converting saved hashes whenever a
successful EPP login or registry lock/unlock request is processed.
We will send comms to registrars to inform them the upcoming removal of
SHA256 support and urge them to log in at least once before the change.
Otherwise, they will need to contact support to reset the password out of
band after the change.
This PR will NOT be submitted until comms are out and the effective date
is immediate.
Co-authored-by: Weimin Yu <weiminyu@google.com>
This makes the callsites look neater, as the work to execute itself is often a
many line lambda, whereas the transaction isolation level is not more than a
couple dozen characters.
The Distinct transform removes duplicates based on the serialized format
of the elements. By providing a deterministic coder, we can guarantee
that no duplicates exist.
This is the start of a long and low priority migration, but for now I wanted to do a couple of them just to see what it looks like.
This also demonstrates the pattern for use of an @AutoBuilder to replace an @AutoValue.Builder. See https://github.com/google/auto/blob/main/value/userguide/records.md#builders for full details on that.
This removes a dependency on the App Engine SDK. It also looks like
(from the logs at least) that shutdown hooks registered the old way stopped
working after the runtime is upgraded to Java 17.
Also removed some random leftover dependencies on the App Engine SKD
that are not needed any more.
* Add java changes for createBillingCostTransitions
* Add negative cost test
* Remove default value
* remove unused variable
* Add check that create cost and trnasitions map are the same
* inject clock, only use key set when checking for missing fields
* Add test for removing map
* Add ALLOW_BSA allocation type
Add a new type to allow creation of domains blocked by BSA.
Except for the BSA semantics, the new type behaves exactly
like SINGLE_USE.
* Addressing reviews
* Addressing review
* Change tld-update to db-object-updater
* rename sync_tlds.sh to sync_db_objects.sh
* Change to configured command name
* Change environment to sandbox explicitly for testing on alpha
* Add remaining object steps and change cloudbuild-tld-sync to cloudbuild-sync-db-objects
* Add build_environment flag
* Change order of command and directory
* Uncomment out reserved list part
Query size is borderline too-large for the replica.
At 50000, about 2 out of 30 took more than 30 seconds and were retried.
Lower to 40000 and we will keep monitoring the executions.
The fallback should only apply on creates, not on updates, otherwise it can
override an existing value for the email address when only the referral email
should be what's updated.
This fixes a bug introduced back in commit in 0ead4f8d9d.
BUG= http://b/322026165
* Stop saving BSA empty refresh changes
We thought that as a way to verify the refresh job to be running, browsing
the GCS bucket with empty files is easier than quering the DB or go to GCP
logging dashboard, but there are too many of them to be useful.
* Add some updates to UpdateReservedListCommand to facilitate internal config presubmits and syncing
Added a dry-run tag for presubmit tests
Added early exit behavior when there are no new changes to the list
Added a new --build_environment tag to be used to indicate command runs from build tools. This tag was also added to UpdatePremiumListCommand. Once this new tag is deployed, and break glass behavior is added, these commands will be modified to prevent runs on the command line in the production environment unless the --build_environment or --break_glass flag is used.
* Fix capitalization
* Added in commented out production environment check for buildEnv flag
The blocking issue is fixed in
https://github.com/google/nomulus/pull/2224.
Java 8 support is being deprecated on 2024-01-31 and no further deployment is
possible afterwards without exception:
https://cloud.google.com/appengine/docs/legacy/standard/java/deprecations
We have been using Java 17 on alpha/crash/qa for several months and have
not oberved any other blocking issue other than possible missing email
attachements, which is being mitigated by including a link to the
attachments saved in GCS.
* Fix BsaRefreshAction bugs
Added functional tests for BsaRefreshAction, which checks for changes in
domain registration and reservation, and apply them to the Unblockable
domain list.
Fixed a few bugs exposed by the tests.
Also refactored a few other tests.
This also moves it back to the replica transaction manager now that it shouldn't be timing
out its queries.
And this adds a test as well (more to come!).
* Define the --end_breakglass and --build_environment flags
It is necessary to define these flags in a deployment before merging go/r3pr/2273 in order to prevent breaking the exisitng TLD syncing and entity presubmit testing that has already been enabled
* make break glass 2 words
* Change break_glass flag to take a Boolean and use false value to end break glass mode
* small fixes
* Fix spacing
* Add missing G
* Add clarifying comment
* Refactor a few BSA unit tests
Added a few helpers for managing reserved list in tests and updated the
tests to use them.
Also fixed a bug: when quering for newly created domains, the query
should be restricted to bsa-enrolled tlds.
* Remove create_tld and update_tld commands
These commands are no longer necessary now that configure_tld command is available. However, the configure_tld command should only be used for crash, QA, and alpha environments. TLDs in production and sandbox must be modified using modifications to their config files in Gerrit unless using the configure_tld command in breakglass mode. Check the "How to configure TLDs" procedure doc for more info.
* re-delete file
- Registered names are not affected.
- Reserved names are not affected.
- Names that are none of the above and match some BSA labels are
reported as blocked.
Also fixes the action taken in the case where zero unavailable domains are
found, and temporarily changes over to using the primary DB (because the replica
transaction was timing out at 30 seconds on large databases). I'll switch this
over to use batching and move it back to replica afterwards, but this should
unblock us temporarily.
This PR makes it possible to build the Nomulus code base using Java 17.
Building with Java 11 continue to be possible and the resulting bytecodes are
still at Java 8 level. Also upgraded Gradle to 8.5.
There are several necessary changes to make this happen:
1. Some Gradle plugins need to be upgraded to support Java 17, notably
errorprone. As a result, a lot more "errors" were caught and corrected.
2. All test code are now built and run at Java 8 level. Previously it was left
undefined (which defaults to the version of the compiler) and had led to
situations where we inadvertently called Java 8+ features in production that
are not caught by tests. The change also made the java8compatibility subproject
obsolete, which is therefore removed.
3. Removed the docs subproject. Its main use is to generate flows.md, but it
relies heavily on Java internal APIs that have changed significant with each
version. Upgrading to Java 11 required extensive refactoring of the code there,
and Java 17 again removed many APIs that were used. I don't think it is worth
the maintenance effort just to have a tool to generate flows.md which no one
actually reads.
4. Capped a few GCP dependencies because the latest version depends on
grpc-java >= 1.59.0, which includes a runtime incompatibility
(https://github.com/grpc/grpc-java/releases/tag/v1.59.0).
* Check BSA block status in CheckApi
Checks for and reports BSA block status if the name is not registered or
reserved.
Also moves CheckApiActionTest to standardTest. Whatever problem forcing
it to another suite has apparently disappeared.
Supports the full blocklist download cycle (download, diffing, diff-apply, and order-status reporting) and the refreshing of unblockable domains.
Submitted due to tight deadline. We will conduct post-submit review and refactoring.
SCRYPT is much computationally heavier than SHA265 (by design), which
resulted in test run time doubling due to most tests initializing canned
data that uses hashing.
Since out tests are not verifying the correctness of a specific hashing
algorithm anyway, this PR makes it so that simple concatenation is used
in tests.
Also moved RegistryEnvironment to the util subproject so it can be called by
PasswordUtils, which makes sense as it is a utility class.
* Add BigInt conversion to TimedTransitionProperty<Money> deserializer to handle JPY currency
* Remove unnecessary lines in test
* Add eap schedule check
* Don't use raw LinkedHashMap type
* add timezone
From our investigation, the Monday night WHOIS storm does not cause any
strain to the backend system. The backend latency metrics are all well within
the limits. The latency measured from the proxy matches observed latency
by the prober, and we see that the "used" CPU is 1.5x of "requested" CPU
during the time when the latency is above the threshold.
Making this change hopefully removes the proxy as the bottleneck and
ameliorate the pages.
Add the BsaDomainRefresh class which tracks the refresh actions.
The refresh actions checks for changes in the set of registered and
reserved domains, which are called unblockables to BSA.
Currently, a verify action is enqueued every time the upload method
succeeds. Because the upload job is wrapped in a transaction, the
same task will be enqueued again if the transaction retries.
We cannot move the upload method outside the transaction because the
read-upload-write logic needs to be atomic, and the upload part itself
is idempotent (therefore retri-able). We can, however, move the
enqueuing part outside the transaction as we only need to enqueue the
verify task once the transaction succeeds. This should fix the issue
where multiple verify jobs try to hit the same marksdb endpoints,
resulting in 429 (Too Many Requests) errors.
* Add a dryrun tag to UpdatePremiumListCommand and early exit command if no new changes to the list
* Change prompt string when no change to list to reflect that there is no actual prompted user input
* Add camelCase and correct flag name
This might be the cause of the SQL performance degradation that we are
observing during the recent launch. The change went in a month ago but
there hasn't been enough increase in mutating traffic to make it
problematic until the launch.
Note that presubmits should run faster too with this chance, which
serves as an evidence that excessive logging is the culprit.
Add the BsaDomainRefresh table which tracks the refresh actions.
The refresh actions checks for changes in the set of registered and
reserved domains, which are called unblockables to BSA.
This doesn't fix any issues with dead/livelocks when deleting or
updating allocation tokens, but it at least will significantly reduce
the time to load the tokens that we'll want to update/delete.
The code as previously written assumed that creation fees would be the
same as renewal fees -- this is not the case for anchor tenants, where
the renewal fee is always the standard cost for the TLD (instead of any
premium cost). This was already handled properly in the actual billing
implementation, but we didn't tell the user the right renewal cost in
domain checks.
This also removes some warning logs related to nested transactions
We shouldn't have to parse through every single entry to see what
changed
Note: we don't do this for premium lists because those can be HUGE and
we don't want/need to load and display every entry. This was an explicit
choice made in https://github.com/google/nomulus/pull/1482
Since the replica SQL instance is read-only, any transaction performed
on it should be explicitly read-only, which would allow PostgreSQL to
optimize away (some) use of predicate locks.
Also changed the EPP cache to read from the replica. The foreign key
cache already behaves this way.
See: https://www.postgresql.org/docs/current/transaction-iso.html
For reasons unclear at this point, Java 17's servlet implementation on
GAE injects IP addresses (including unroutable private IPs) into the
standard X-Forwarded-For header, which we currently use to embed
registrar IP addresses to check against the allow list. This results in
the server not properly parsing the header and rejecting legitimate
connections.
This PR sets a custom header that should not be interfered with by any
JVM implementation to store the IP address, while maintaining the old
header as a fallback. The proxy will set both headers to allow the
server to gracefully migrate from Java 8 and Java 17 (and potentially
rollback).
Also removed some headers and logic that are not used.
This does not include any styling for now, just wanted to make sure
we're all good with regards to the basic approach. I'm open to suggestion on
which columns to include.
Note: filter searching is not implemented yet because the backend does
not allow for it (yet)
Java 17 injects unexpected headers to X-Forwarded-For, which causes
issues with validating incoming IP addresses.
This is a partial reversion of #2201. We are still keeping Java 17 in other environment but sandbox and production needs to be able to parse the header to accept incoming EPP connections from registrars. Once we fix it we will re-enable Java 17 in these environment.
This allows us to switch the proxy to a different client ID without
disrupting the service. This is a temporary measure and will be removed
once the switch is complete.
Also adds a placeholder getter in the Tld class, so that it can be
mocked/spied in tests. This way more BSA related code can be submitted
before the schema is deployed to prod.
This also adds a targeted QPS as a parameter in case we need to manually bump it
up (or down) for some reason without having to make code changes and re-deploy.
This PR makes a few changes to make it possible to turn on
per-transaction isolation level with minimal disruption:
1) Changed the signatures of transact() and reTransact() methods to allow
passing in lambdas that throw checked exceptions. Previously one has
always to wrap such lambdas in try-and-retrow blocks, which wasn't a
big issue when one can liberally open nested transactions around small
lambdas and keeps the "throwing" part outside the lambda. This becomes a
much bigger hassle when the goal is to eliminate nested transactions and
put as much code as possible within the top-level lambda. As a result,
the transactNoRetry() method now handles checked exceptions by re-throwing
them as runtime exceptions.
2) Changed the name and meaning of the config file field that used to
indicate if per-transaction isolation level is enabled or not. Now it
decides if transact() is called within a transaction, whether to
throw or to log, regardless whether the transaction could have
succeeded based on the isolation override level (if provided). The
flag will initially be set to false and would help us identify all
instances of nested calls and either refactor them or use reTransact()
instead. Once we are fairly certain that no nested calls to transact()
exists, we flip the flag to true and start enforcing this logic.
Eventually the flag will go away and nested calls to transact() will
always throw.
3) Per-transaction isolation level will now always be applied, if an
override is provided. Because currently there should be no actual
use of such feature (except for places where we explicitly use an
override and have ensured no nested transactions exist, like in
RefreshDnsForAllDomainsAction), we do not expect any issues with
conflicting isolation levels, which would resulted in failure.
3) transactNoRetry() is made package private and removed from the
exposed API of JpaTransactionManager. This saves a lot of redundant
methods that do not have a practical use. The only instances where this
method was called outside the package was in the reader of
RegistryJpaIO, which should have no problem with retrying.
We have been using SHA256 to hash passwords (for both EPP and registry lock),
which is now considered too weak.
This PR switches to using Scrypt, a memory-hard slow hash function, with
recommended parameters per go/crypto-password-hash.
To ease the transition, when a password is being verified, both Scrypt
and SHA256 are tried. If SHA256 verification is successful, we re-hash
the verified password with Scrypt and replace the stored SHA256 hash
with the new one. This way, as long as a user uses the password once
before the transition period ends (when Scrypt becomes the only valid
algorithm), there would be no need for manual intervention from them.
We will send out notifications to users to remind them of the transition
and urge them to use the password (which should not be a problem with
EPP, but less so with the registry lock). After the transition,
out-of-band reset for EPP password, or remove-and-add on the console for
registry lock password, would be required for the hashes that have not
been re-saved.
Note that the re-save logic is not present for console user's registry
lock password, as there is no production data for console users yet.
Only legacy GAE user's password requires re-save.
The domains (and their associated billing recurrences) were already being loaded
to check restores, but they also now need to be loaded for renews and transfers
as well, as the billing renewal behavior on the recurrence could be modifying
the relevant renew price that should be shown. (The renew price is used for
transfers as well.)
See https://buganizer.corp.google.com/issues/306212810
Both actions have not been used for a while (the wipe out action
actually caused problems when it ran unintentionally and wiped out QA).
Keeping them around is a burden when refactoring efforts have to take
them into consideration.
It is always possible to resurrect them form git history should the need
arises.
We finally fixed Spinnaker (I hope) to deploy bundled services with Java
17 runtime. Note that the bytecodes are still targeting Java 8. The only
change this PR introduces is to switch the runtime environment to Java
17.
TESTED=deployed to crash.
This was already set to true in all environments except prod last week. Now that the release has gone out and we have not seen any issues, we should feel safe turning this on in production as well.
We previously needed to use URLFetch in some instances where TLS 1.3 is
required (mostly when connecting to ICANN servers),and the JDK-bundled SSL
engine that came with App Engine runtime did not support TLS 1.3.
It appears now that the Java 8 runtime on App Engine supports TLS 1.3
out of the box, which allows us to get rid of URLFetch, which depends on
App Engine APIs.
Also removed some redundant retry and logging logic, now that we know
the HTTP client behaves correctly.
TESTED=modified the CannedScriptExecutionAction and deployed to alpha, used the
new HTTP client to connect to the three URL endpoints that were
problematic before and confirmed that TLS connections can be established. HTTP
sessions were rejected in some cases when authentication failed, but
that was expected.
* Add a cloudbuild-tld-sync job
This job checks the Tld config files in the internal repo and syncs them with the actual Tld objects in the database using the configure_tld numulus command.
* Add the dockerfile and shell script
* Force the command
* Add comments
* add newline
* Create a separate copy of the job for each environment
* fix file name
* Fix indentation
* Lower the isolation level for RefreshDnsForAllDomainsAction
This lowers the isolation level to TRANSACTION_REPEATABLE_READ which will hopefully allow the action to run the entire action without timing out on our larger TLDs.
* Unchange default config
Cache loads will likely always be inner transactions, if they have a transaction at all. Cache loads do not always call a transaction since they are only necessary if the cache is not fresh at the time it is called. Since the cache itself needs to decide whether or not a DB transaction is necessary, it should use the reTransact method to safely indicate that the isolation level of the outer transaction is what should be used.
If the cert(s) are invalid or expired that's a problem, but that
shouldn't necessarily prevent us from changing other things. If we're
not changing the certs, leave them alone.
If we don't explicitly handle random unexpected exceptions, the error
that the front end receives includes a big ole stacktrace, which is
unhelpful for regular users and possibly bad to expose. Instead, we
should provide a vague "something went wrong" message.
Separately, we can create a default SnackBar options and use that (we
want it longer than 1.5 seconds because that's pretty short).
Also made some refactoring to various Auth related classes to clean up things a bit and make the logic less convoluted:
1. In Auth, remove AUTH_API_PUBLIC as it is only used by the WHOIS and EPP endpoints accessed by the proxy. Previously, the proxy relies on OAuth and its service account is not given admin role (in OAuth parlance), so we made them accessible by a public user, deferring authorization to the actions themselves. In practice, OAuth checks for allowlisted client IDs and only the proxy client ID was allowlisted, which effectively limited access to only the proxy anyway.
2. In AuthResult, expose the service account email if it is at APP level. RequestAuthenticator will print out the auth result and therefore log the email, making it easy to identify which account was used. This field is mutually exclusive to the user auth info field. As a result, the factory methods are refactored to explicitly create either APP or USER level auth result.
3. Completely re-wrote RequestAuthenticatorTest. Previously, the test mingled testing functionalities of the target class with testing how various authentication mechanisms work. Now they are cleanly decoupled, and each method in RequestAuthenticator is tested individually.
4. Removed nomulus-config-production-sample.yaml as it is vastly out of date.
Surround the dot in unsafe domain names with a square bracket. This
is suggested by Gmail abuse-detection and allows outgoing messages
to pass Gmail's check. This should also help with recipients' checks.
Adds a delay between emails sent in a tight loop. This helps avoid
triggering Gmail abuse detections.
Also updated the recipient address for billing alerts.
* Add custom YAML serializer for Duration
This addresses b/301119144. This changes the YAML representation of a TLD to show Duration fields as a String reperesntation using the Java Duration object's toString() format. This eliminates the previous ambiguity over the time unit that is being used for each duration.
* change standardSeconds to standardMinutes in test
* Add custom serializer to the entire mapper
Unless an --oauth flag is used, the nomulus tool will only send the OIDC
header. The server still accepts both headers and the user should use
`create_user` command to create an admin User (with the --oauth flag on), which
will then allow one to use the nomulus tool without the --oauth flag.
The --oauth flag and the server's ability to support OAuth-based
authentication will be removed soon. Users are urged to create the User
object in time to avoid service interruption.
TESTED=verified on alpha.
We want to do this because it takes the user to an external site, which
could potentially lead to confusion if they tried to use the back button
without a new tab.
Add documentation that describes the current Cloud Build status notification
to Google Chat, as well as how to update the configuration and the
notifier service.
This isn't the prettiest thing, but it replicates the type of view /
edit functionality that we had in the original console.
Of note: this doesn't include input field validation, which would
probably be a good idea to add at some point.
It will call equalsImmutableObject(), which seems the right thing to do.
We only care if the two Tld objects have the same fields, not if they
are the same object. ErrorProne complained about comparison by identity.
* Defend against deserialization-based attacks
Added the `SafeObjectInputStream` class that defends attacks using
malformed serialized data, including remote code execution and
denial-of-service attacks.
Started using the new class to handle EPP resource VKeys and
PendingDeposits, which are passed across credential-boundaries: between
TaskQueue and AppEngine server, and between AppEngine server and the RDE
pipeline on GCE. Note that the wireformat of VKeys do not change,
therefore existing tasks sitting in the TaskQueue are not affected.
Also removed an unused class: JaxbFragment.
This token is only ever used for logging. The GAE OAuth service will
parse the header directly when called to retrieve the current user and
user id. Logging it in prod could be a security risk if the logs are
leaked.
Previously this didn't properly deal with nested routings, e.g.
"settings/whois". It tried to just pass "whois" as the next url which
doesn't work with the router because it's nested under the settings.
Using all parts of the URL allows us to handle the nesting.
* Add a configureTld command that uses YAML
* Add more tests and edge case handling
* Add out of order test and fix wrong inject
* small changes
* Add check for ascii
* Add check for ROID suffix
Hibernate logs certain information at the ERROR level, which for the
purpose of troubleshooting is misleading, since most affected operations
succeed after retry. ERROR-level logging should only be added by Nomulus
code.
This PR does two things:
1. Disable all logging in two Hibernate classes: we cannot disable
logging at a finer granularity, and we cannot preserve lower-level
logging while disabling ERROR.
2. Adds a DatabaseException which captures all error details that may
escape the typical loggers' attention: SQLException instances can be
chained in a different way from Throwable's `getCause()` method.
The old semantics for TransactionalFlow meant "anything that needs to mutate the
database", but then FlowRunner was not creating transactions for
non-transactional flows even though nearly every flow needs a transaction (as
nearly every flow needs to hit the database for some purpose). So now
TransactionalFlow simply means "any flow that needs the database", and
MutatingFlow means "a flow that mutates the database". In the future we will
have FlowRunner use a read-only transaction for TransactionalFlow and then a
normal writes-allowed transaction for MutatingFlow. That is a TODO.
This also fixes up some transact() calls inside caches to be reTransact(), as we
rightly can't move the transaction outside them as from some callsites we
legitimately do not know whether a transaction will be needed at all (depending
on whether said data is already in memory). And it removes the replicaTm() calls
which weren't actually doing anything as they were always nested inside of normal
tm()s, thus causing confusion.
In the future, reTransact() will be the only way to initiate a transaction that
doesn't fail when called inside an outer wrapping transaction (when wrapped,
it's a no-op). It should be used sparingly, with a preference towards
refactoring the code to move transactions outwards (which this PR also
contains).
Note that this PR includes some potential efficiency gains caused by existing
poor use of transactions. E.g. in the file RefreshDnsAction, the existing code
was using two separate transactions to refresh the DNS for domains and hosts
(one is hidden in loadAndVerifyExistence(), whereas now as of this PR it has a
single wrapping transaction to do so.
It turns out that disallowing all nested transaction is impractical. So
in this PR we make it possible to run nested transactions (which are not
really nested as far as SQL is concerned, but rather lexically nested
calls to tm().transact() which will NOT open new transactions when
called within a transaction) as long as there is no conflict between the
specified isolation levels between the enclosing and the enclosed
transactions.
Note that this will change the behavior of calling tm().transact() with
no isolation level override, or a null override INSIDE a transaction.
The lack of the override will allow the nested transaction to run at
whatever level the enclosing transaction runs at, instead of at the
default level specified in the config file.
* Fix Cloud Tasks failure to retry
Replace `SC_NOT_MODIFIED` (304) with `SC_SERVICE_UNAVAILABLE` (503) when
data is not available yet. Affected actions are invoice- and spec11-publishing.
It is confirmed that Cloud Tasks currently does not retry with code 304,
despite the public documentation stating so. We will use 503 for now,
pending the decision by Cloud Tasks whether to change behavior or
documentation.
The code `TOO_EARLY` (425) is another alternative. It is not meant for
our use case but at least sounds like it is. However, it is not in any
javax.servlet jar. We don't want to define our own constant, and we cannot upgrade
to jakarta.servlet yet.
Also revert previous mitigation.
This includes a bit of refactoring of the GSON creation. There can exist
some objects (e.g. Address) where the JSON representation is not equal to the
representation that we store in the database. For these objects, when
deserializing, we should update the objects so that they reflect the
proper DB structure (indeed, this is already what we do for the XML
parsing of Address).
* Change PackagePromotion to BulkPricingPackage
* More name changes
* Fix some test names
* Change token type "BULK" to "BULK_PRICING"
* Fix missed token_type reference
* Add todo to remove package type
A config file field is added to control if per-transaction isolation
level is actually used. If set to true, nested transactions will throw
a runtime exception as the enclosing transaction may run at a different
isolation level.
In this PR we only add the ability to specify the isolation level,
without enabling it in any environment (including unit tests), or
actually specifying one for any query. This should allow us to set up
the system without impacting anything currently working.
* Mitigate Cloud task retry problem
Increase PublishSpec11Action start delay to avoid the need to retry.
The only other use case is invoice, which typically does not retry:
delay is 10 minutes, pipeline finishes within 7 minutes.
This includes two changes:
1. Creating a base string-type adapter for use parsing to/from JSON
classes that are represented as simple strings
2. Changing the object-provider methods so that the POST bodies should
contain precisely the expected object(s) and nothing else. This way,
it's easier for the frontend and backend to agree that, for instance,
one POST endpoint might accept exactly a Registrar object, or a list
of Contact objects.
Co-Authored-By: gbrodman <gbrodman@google.com>
This includes two changes:
1. Creating a base string-type adapter for use parsing to/from JSON
classes that are represented as simple strings
2. Changing the object-provider methods so that the POST bodies should
contain precisely the expected object(s) and nothing else. This way,
it's easier for the frontend and backend to agree that, for instance,
one POST endpoint might accept exactly a Registrar object, or a list
of Contact objects.
# The branches below must be a subset of the branches above
branches:['master']
schedule:
- cron:'24 4 * * 2'
jobs:
analyze:
@@ -27,11 +25,17 @@ jobs:
steps:
- name:Checkout repository
uses:actions/checkout@v3
uses:actions/checkout@v4
- name:Set Java version
uses:actions/setup-java@v4
with:
distribution:'temurin'
java-version:'21'
# Initializes the CodeQL tools for scanning.
- name:Initialize CodeQL
uses:github/codeql-action/init@v2
uses:github/codeql-action/init@v3
with:
languages:${{ matrix.language }}
# If you wish to specify custom queries, you can do so here or in a config file.
@@ -41,11 +45,20 @@ jobs:
# Details on CodeQL's query packs refer to : https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs
*ngFor="let registrar of registrarService.registrars"
[value]="registrar"
>
{{ registrar }}
</mat-option>
</mat-select>
</mat-form-field>
</div>
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.