The primary annoyance with this is that it means we need (or at least,
should) split all tests that use the fee extension into two separate
tests -- one that simulates non-prod environments, and one that
simulates prod environments. This leads to duplication of many tests but
that's fine since this is theoretically temporary.
* Configure cloud scheduler to trigger MoSAPI SLA status to cloud monitoring in production
- We have kept this job to trigger for every 3 minutes so that we get near to real time update for our task.
- This will not trigger metrics for now as we have not written Metrics triggering logic yet
- Logs are added
* Change Trigger scheduling from 3 minutes to 5 minutes
This primarily affects the EPP greeting. We already were erroring out when any
contact flows attempted to be run; this should just prevent registrars from even
trying them at all.
This PR is designed to be minimally invasive, and does not remove any of the
contact flows or Jakarta XML/XJC objects/files themselves. That can be done
later as a follow-up.
Also note that the contact namespace urn:ietf:params:xml:ns:contact-1.0 is still
present for now in RDE exports, but I'll remove that subsequently as well.
BUG= http://b/475506288
This means that attempting to add a status that is already present will now
fail, and attempting to remove a status that is not present will also now fail.
This also refactors the existing checks into a single verify method, rather than
having to call three separate methods from every callsite.
BUG= http://b/474645068
* add cloud profiler to dockerfile and start script
* add apt-get update
* change in cb machine type for nomulus
* fix typo
* add max worker limit to gradle tests
* Switch to root before doing apt-get
* correct dockerfile
* jetty/Dockerfile
* profiler service conditional to kubernetes container name
Many of the actual fee extension changes are based off Weimin's PR
https://github.com/google/nomulus/pull/2912, though this makes some
additional changes based on the XML schema and description from RFC 8748.
This adds tests for the DomainCheckFlow which is the most complex and
thorough user of the fee extension, but we'll want to add further tests
to the other domain flows to make sure they're handled correctly.
It just makes it possible to delete allocation tokens, otherwise we need
to do a linear search over the entire Domain and DomainHistory tables if
we ever want to delete something.
The RST testing expects us to fail if they try to remove an IP from a
host that already doesn't that have that IP, or to add one that already
exists (ditto on both for a domain's nameservers). I don't really see an
issue with our previous no-op implementation, but we need to do this to
pass the tests.
We need to have this enabled in sandbox, but we wish to wait to enable
it for production to make sure that the implementation is correct and
that clients can use it.
Soon we'll want to do something similar (but the opposite) with the old
fee extensions, where we **only** serve them in production (or maybe
unit test as well). That will allow us to pass the RST tests that depend
on only having the fee extension 1.0.
This is / will be required in https://datatracker.ietf.org/doc/rfc8748/.
I split this out from the rest of the fee-extension testing so that it
can be easily visible.
This commit introduces a new backend endpoint at `/_dr/task/triggerMosApiServiceState` that initiates the process of fetching the latest service states for all TLDs from the MoSAPI endpoint and exporting them as metrics to Cloud Monitoring.
The key changes include:
- A new `TriggerServiceStateAction` class that handles the GET request to the new endpoint.
- Logic within `MosApiStateService` to concurrently fetch states for all configured TLDs.
- A new `MosApiMetrics` class (currently a placeholder) responsible for sending the collected states to the monitoring service.
- Unit tests for the new action and the updated service logic.
This endpoint will be called periodically to ensure that the MosApi service health metrics are kept up-to-date.
this is coming from the schema https://datatracker.ietf.org/doc/rfc8748/
section 6.1. The class, that we use for "premium" notes, moved from the
command to the object itself.
I don't know where in the spec these are explicitly disallowed, but it
seems like good practice and we'll fail the RST tests if we don't
disallow them.
* Add GetServiceState action for MoSAPI service monitoring
Implements the `/api/mosapi/getServiceState` endpoint to retrieve service health summaries for TLDs from the MoSAPI system.
- Introduces `GetServiceStateAction` to fetch TLD service status.
- Implements `MosApiStateService` to transform raw MoSAPI responses into a curated `ServiceStateSummary`.
- Uses concurrent processing with a fixed thread pool to fetch states for all configured TLDs efficiently while respecting MoSAPI rate limits.
junit test added
* Refactor MoSAPI models to records and address review nits
- Convert model classes to Java records for conciseness and immutability.
- Update unit tests to use Java text blocks for improved JSON readability.
- Simplify service and action layers by removing redundant logic and logging.
- Fix configuration nits regarding primitive types and comment formatting.
* Consolidate MoSAPI models and enhance null-safety
- Moves model records into a single MosApiModels.java file.
- Switches to ImmutableList/ImmutableMap with non-null defaults in constructors.
- Removes redundant pass-through methods in MosApiStateService.
- Updates tests to use Java Text Blocks and non-null collection assertions.
* Improve MoSAPI client error handling and clean up data models
Refactors the MoSAPI monitoring client to be more robust against
infrastructure failures
* Refactor: use nullToEmptyImmutableCopy() for MoSAPI models
Standardize null-handling in model classes by using the Nomulus
`nullToEmptyImmutableCopy()` utility. This ensures consistent API
responses with empty lists instead of omitted fields.
* Add RST support in Sandbox
Added RST test label files as resources.
Added a RstTmchUtils class that loads appropriate labels according to
TLD pattern.
Temporarily changed label fetching in production to include the TLD
string, so that the new class may know which set of labels to use.
* Addressing comments
* Addressing comments
We do most of these on host create already so we should also do them on
host checks. The only added change is the character validation (our
existing hostnames all match these).
2306 signifies something that is syntactically valid but semantically
invalid (like if someone tried to register a .com domain). These errors
are for domain syntax that could never be valid, thus we should throw a
syntax exception instead of a policy exception.
Now that we have transitioned to the minimum dataset, we no longer
support any actions on contacts (and by the time this is merged /
deployed, all contacts will be deleted). We should just throw an
appropriate exception on all contact-related flows. We don't delete the
flows themselves, so that we can have an appropriate error message.
We also keep all the flows and XML templates around individually for now because we may be
required to continue to differentiate the requests in ICANN activity
reporting (e.g. srs-cont-create vs srs-cont-delete)
This is a follow-up to PR #2908, which relaxed this restriction on bare TLDs
only, but now we also allow it systemwide on domains and hostnames as well. The
rules against hyphens in these positions are still enforced on all parts of the
domain name except the last one. Correct handling of multi-part TLDs in this
regard is out of scope in this PR; a multi-part TLD that looked something like
".zz--foobar.foobar" would still fail validation. (But of course you cannot a
priori know just from looking at a 3-part string whether it might be a hostname
on a normal TLD, or a domain name on a 2-part TLD.)
This also has some annoying interactions with a trailing dot (indicating the
root), which need to be preserved, but otherwise don't affect how TLD validation
is handled.
BUG= http://b/471013082
The canonicalizeHostname() helper method is only suitable for use with domain
names or host names. It does not work on bare TLDs, because a bare TLD can
have hyphens in the third and fourth position without necessarily being an IDN.
Note that the configure TLD command already correctly allows TLDs with such
names to be created.
Note that we are still enforcing that the TLDs to be added exist, so they have
to pass all TLD naming requirements that are enforced on creating TLDs, and we
are still lowercasing the TLD names passed as arguments here (though we're no
longer punycoding them, although arguably that's not super useful on
command-line params anyway).
BUG= http://b/471013082
We used to publish test artifacts to a Maven repo on GCS, for use by
schema tests. For this to work with Kokoro, the GCS bucket must be
accessible to all users.
To comply with the no-public-user requirement, we store the necessary
jars at at well-known bucket and map them into Kokoro. This strategy
cannot be used on the Maven repo because only a small number of files
with fixed names may be mapped. With the Maven repo, there are too many
files to map.
This is a decently simple wrapper around the previously-created
BulkDomainTransferAction that batches a provided list of domains up and
sends them along to be transferred.
This PR finds instances where we previously checked if the feature flag
for contacts-prohibited was set and removes those checks, making the
contacts-prohibited behavior the only behavior. Because the tests didn't
have that feature flag set, this means we need to change a ton of tests
to remove contact references.
* Add mosapi client to intract with ICANN's monitoring system
This change introduces a comprehensive client to interact with ICANN's Monitoring System API (MoSAPI). This provides direct, automated access to critical registry health and compliance data, moving Nomulus towards a more proactive monitoring posture.
A core, stateless MosApiClient that manages communication and authentication with the MoSAPI service using TLS client certificates.
* Resolve review feedback & upgrade to OkHttp3 client
This commit addresses and resolves all outstanding review comments, primarily encompassing a shift to OkHttp3, security configuration cleanup, and general code improvements.
* **Review:** Addressed and resolved all pending review comments.
* **Feature:** Switched the underlying HTTP client implementation to [OkHttp3](https://square.github.io/okhttp/).
* **Configuration:** Consolidated TLS Certificates-related configuration into the dedicated configuration area.
* **Cleanup:** Removed unused components (`HttpUtils` and `HttpModule`) and performed general code cleanup.
* **Quality:** Improved exception handling logic for better robustness.
* Refactor and fix Mosapi exception handling
Addresses code review feedback and resulting test failures.
- Flattens package structure by moving MosApiException and its test.
- Corrects exception handling to ensure MosApiAuthorizationException
propagates correctly, before the general exception handler.
- Adds a default case to the MosApiException factory for robustness.
- Uses lowercase for placeholder TLDs in default-config.yaml.
* Refactor and improve Mosapi client implementation
Simplifying URL validation with Guava
Preconditions and refining exception handling to use `Throwables`.
* Refactor precondition checks using project specific utility
This will be necessary if we wish to do larger BTAPPA transfers (or
other types of transfers, I suppose). The nomulus command-line tool is
not fast enough to quickly transfer thousands of domains within a
reasonable timeframe.
After each deployment in sandbox or production, move the artifacts from
the corresponding release to a well-known location so that they can be
mapped to Kokoro in presubmit tests. The Kokoro-mapping does not need
public access to the GCS bucket.
The artifacts include the postgresql schema jar, the nomulus release
jar, and the uber jar of the nomulus schema integration test classes.
Every jar name consists of a fixed prefix and the environment. Each jar
of a new deployment overrides the previous copy.
These are old/pointless now that we've migrated to GKE. Note that this
doesn't update anything in the docs/ folder, as that's a much larger
project that should be done on its own.
Add a flag indicating that a Sandbox TLD should use the production
servers.
No additional TLD name pattern checks. Cloud DNS has an allowlist for
names that may use production servers.
Also updated default descriptive name generation: dropping the trailing
'.', and replacing remaining dots with '_'.
This contains the same codepoints from the
core/src/main/java/google/registry/idn/Latin-IDN.xml file, just in the old .txt
IDN format that Nomulus actually ingests.
We don't need to support the mix of GAE and GKE any more so we can get
rid of the GaeService bits and unify everything under one constant
service. This also allows us to reduce the number of services down to
four (FE, BE, PUBAPI, console) which is nice.
We've previously been using Scrypt since PR #2191 which, while being a
memory-hard slow function, isn't the optimal solution according to the
OWASP recommendations. While we could get away with increasing the
parallelization parameter to 3, it's better to just switch to the
most-recommended solution if we're switching things up anyway.
For the transition, we do something similar to PR #2191 where if the
previous-algorithm's hash is successful, we re-hash with Argo2id and
store that version. By doing this, we should not need any intervention
for registrars who log in at any point during the transition period.
Much of this PR, especially the parts where we re-hash the passwords in
Argon2 instead of Scrypt upon login, is based on the code that was
eventually removed in #2310.
Reformat the current schema file for RFC 8748 final version. This was
adapted from v0.12 is not fully consistent with the final schema
This helps highlight the differences we missed in PR 2855 when we check
in the official schema.
The servlets, at this point now that we're off GAE, are only used for
the test server (and, indirectly, in one BSA test). Instead of having
them all remain separate, we can unify them in one test servlet that
lives in the test/ folder.
This removes one avenue of potential confusion w/r/t how request routing
actually works and where we would want to add new routing.
We need to stop using maven repo on GCS to store artifacts for the
schema compatibility tests. After public access is removed from GCS
buckets, Kokoro won't be able to access it: normal access will be
denied, and the repo is too large to map (copy) to Kokoro VM as a
resource.
This PR uploads the relevant jars to each release's folder. See
go/dr-gcs-public-access-prevention for details.
With GKE, we don't need the individual servlets because the services
aren't partitioned out the same way they were in GAE.
We keep FrontendServlet and BackendServlet around for now as they serve
as the backbone for the local RegistryTestServer (for testing things
like the console).
did some cursory tests on alpha and things seem to be unaffected -- I
was able to curl RDAP (pubapi) and create domains
It's been over half a year now since we last used any of these and we definitely
no longer have any intentions of ever using App Engine again.
BUG= http://b/457471639
Still part of b/454947209, removing references to WHOIS where we can. We
keep the registrar type and the column names (at least for now) because
changing those is much more complicated.
We no longer need this now that no contacts can be applied to any domains at all.
A follow-up PR in subsequent weeks will delete the column from the DB schema.
BUG= http://b/448619572
Previously, we would have separate database calls for mapping from
foreign key to repo ID and then from repo ID to object. This PR modifies
those calls to load the resource directly (the old system was an
artifact of the Datastore key-value storage system).
In this PR, we merge the load-resource-by-foreign-key calls into a
single database load, as well as adding a separate cache object for
(foreign key) -> (resource). Now we cache, and have separate cleaner
code paths, for fk -> resource, fk -> repo ID, and repo ID -> resource.
Also removes the unused RdeFragmenter class
* Support Fee Extension standard in rfc 8748
Adding support to the final version of RFC 8748.
Compared with draft-0.12, the only meaningful change is in the namespace.
The rest is either schema-tightening that reflects actual usage, or
optional server-side features that we do not support.
We reuse draft-0.12 tests, only changing namespace uris in the input and
output files for the new version.
* Addressing reviews
We haven't been serving this for a while, let's finally get rid of them.
We keep some Soy rules around in the presubmits file because we use some
Soy files as XML templates for EPP actions.
This doesn't make any underlying implementation details, and is mainly useful to
reduce the number of diffs in PR #2852 (which does change implementation
details) thus making that easier to review.
This allows us to specify a getter delegation to bypass Hibernate's limitations
on field types for the purposes of, e.g., using a sorted set in toString()
output rather than the base Hibernate unsorted HashSet type.
BUG=http://b/448631639
This affects FKs pointing to both Contact and ContactHistory. This is in
preparation to us deleting all rows in those two tables, and then subsequently
removing all application logic having to do with contacts entirely.
This is steps one and two of b/454947209
We already haven't been serving WHOIS for a while, so there's no point
in keeping the old code around. This can simplify some code paths in the
future (like, certain foreign-key-loads that are only used in WHOIS
queries).
Basically, what happened is that the cache's expireAfterWrite was being
called some number of milliseconds (say, 50-100) after the transaction
was started. That method used the transaction time instead of the
current time, so as a result the entries were sticking around 50-100ms
longer in the cache than they should have been.
This fix contains two parts, each of which I believe would be sufficient
on their own to fix the issue:
1. Use the currentTime passed in in Expiry::expireAfterCreate
2. Use the transaction time in the cache's Ticker. This keeps everything
on the same schedule.
During the release process, we are seeing the message "Gradle build daemon disappeared unexpectedly (it may have been killed or may have crashed)" which seemingly can be caused by OOMs
I analyzed SQL statements run during the following flows and EXPLAIN
ANALYZEd each of them to figure out if there are any additional hash
indexes we could add that could be particularly helpful. Note: it's not
worth adding a hash index on the host_repo_id field in DomainHost
because so many rows (domains) use the same host.
- domain create
- domain delete
- domain info
- domain renew
- domain update
- host create
- host delete
- host update
I skipped the ones that use the read-only replica, as well as contact
flows (we're getting rid of them), and domain transfer/restore-related
flows as those are extremely infrequent.
Updates (AKA merges) run an extra SELECT statement to figure out if the
resource exists so that it can merge the entity into the existing object
in Hibernate's schema. When we're inserting new rows (such as new poll
messages or resource creates), we know that we don't need to do that
merge. Doing this should save us some SELECT statements (this has borne
out to be the truth in alpha)
This should help in instances of popular domains dropping, since we
won't need to do an additional two database loads every time (assuming
the deletion time is in the future).
When running the action in sandbox on 1.5M domains, it failed a few times
updating individual domains (requiring a manual restart of the entire action).
It's better to just log the individual failures for manual inspection and then
otherwise continue running the action to process the vast majority of other
updates that won't fail.
BUG = http://b/439636188
Add a flag to the CreateCdnsTld command to bypass the dns name format
check in Sandbox (limiting names to `*.test.`). With this flag, we
can create TLDs for RST testing in Sandbox.
Note that if the new flag is wrongly set for a disallowed name, the
request to the Cloud DNS API will fail. The format check in the command
just provides a user-friendly error message.
I went through all the SQL statements generated by some sample
DomainCreateFlow and DomainDeleteFlow cases to find situations where we
were either SELECTing from, or UPDATEing, tables with a direct "field =
value" format. These are the situations that I found where we can add
hash indexes. This does two things:
1. Makes these queries slight faster, since these are usually queries on
columns that are either unique or very close to unique, and O(1) is
faster than O(log(n))
2. Spreads around the optimistic predicate locks on the previously-used
btree indexes. Many of our serialization errors came from the fact
that we were autogenerating incrementing ID values for various
tables, meaning that SELECTs, INSERTs, and UPDATEs would all try to
take predicate locks out on the same page of the btree index. Using a
hash index means that the page locks will be spread out to various
index pages, rather than conflicting with each other.
Running load tests on alpha I see significant improvements in speed and
error rates. Speed is hard to quantify due to the nature of the way the
load tests distribute tasks among the queues but it could be more than
50% improvement, and serialization errors in the logs drop by more than
90%.
This implements the first part of Minimum Data Set phase 3, wherein we delete
all contact data. This action is necessary to leave a permanent record on the
domain (in the form of a domain history entry) documenting when the contacts
were removed by the administrative user.
Then, after this has finished removing all contact assocations, we can simply
empty out or drop the Contact/ContactHistory tables and associated join tables.
* Skip user loading for proxy service account
Reduces database load by skipping the User entity lookup for the proxy
service account during OIDC authentication.
The high volume of EPP "hello" and "login" commands from the proxy
service account results in a constant database load. These lookups
are unnecessary as the proxy service account is not expected to have a
corresponding User object.
This change optimizes the authentication flow by checking for the proxy
service account email *before* attempting to load a User from the
database. This bypasses the database transaction entirely for these
high-volume requests.
This approach is more efficient than caching, as it eliminates the
database lookup for the proxy service account altogether, rather than
just caching the result.
* comment added and service account llokup time improved
* comment updated for more clarity
* Add cache for User entities in OIDC auth flow
* refactor: Address review feedback
- Refactor database call into a single, reusable method
- Increase the default cache size to 200
- Remove .recordStats() and using spy for testing
- Split unit tests into separate implementation test that use Mockito spies instead of checking internal cache stats
We've moved these over to the User class, so we should remove these for
clarity. In addition, we should make it clear (in Java at least) that
the field in the RegistryLock object refers to the email address used
for the lock in question.
This allows us to also check / modify the CharlestonRoad registrar in
the console, and also allows us to test actions (like password reset)
using that registrar in the prod environment.
* Fix OOM in UploadBsaUnavailableDomains action
The action was using string concatenation to generate the upload content.
This causes an OOM when string length exceeds 25MB on our current VM.
This PR witches to streaming upload.
Also added an HTTP upload test.
* Fix OOM in UploadBsaUnavailableDomains action
The action was using string concatenation to generate the upload content.
This causes an OOM when string length exceeds 25MB on our current VM.
This PR witches to streaming upload.
Also added an HTTP upload test.
Error happened in the case that an unblockable name reported with
'Registered' as reason has been deregistered. We tried to check the
deletion time of the domain to decide if this is a transient error
that is no worth reporting. However, we forgot that we do not have
the domain key in this case.
As best-effort action, and with a case that rarely happens, we decide
not to make the optimization (staleness check) in thise case.
Note: this still includes "contacts" for registrars, which are actually
a different concept that we call RegistrarPoc. That's different from
"Contact" objects, e.g. registrant.
The TLD is technically valid but it doesn't exist for us -- we should
return 404 instead of 400 in these situations according to the RDAP
conformance docs
Load balancer / internal redirections can result in the final request
URL lacking "https" when finally getting to the servlet. As a result,
even if you use https in the request, the resulting URL can be plain
http.
We need to include the actual (HTTPS) URL in the output, so replace it.
This increases hikari fetch size to 40 from 20 in order to decrease the
amount of round trips
This also sets lower CPU as we seem to have overshot CPU consumtion
This also set min replicas to 8 for EPP and max to 16 as we've been
running on 8-10 for the last week
Tested locally and on alpha with dummy values (and throwing an
exception).
I was able to reuse a bit of code from the EPP password reset, but not
all of it.
This is the first in a series of PRs to implement the expiry access period
(XAP). The overall fee schedules will be set in YAML config files, so the only
DB change necessary should be this single new boolean column on the Tld entity,
which defaults to false so as to require XAP explicitly being turned on for a
given TLD.
BUG=http://b/437398822
Postgresql-17.6 introduces two new lines in pg_dump output as a
security feature: `\restricted {HASH}` and `\unrestricted {HASH}`.
We filter out lines starting with these two prefixes when comparing
schemas.
The db upgrade also adds two empty lines to the pg_dump output. We
know ignore all empty lines when comparing schemas.
It's necessary to remove the GAE-related code (and use GKE launch
commands instead), and we might as well remove contact-related fields
and actions because of the upcoming move to the minimum data set.
This reverts commit 5cef2dd8b5.
We faced CPU quota issue with standard machine type, so rolling back to c4
for now to monitor server performance and decide if we want to try to
downgrade again in the future.
We probably want this to run before the billing recurrence expansion
pipeline just in case there are any domains that should be deleted
before their billing recurrence gets expanded.
nodeSelector can limit scheduling capabilities of k8s, which leads to delays in assigning new workloads. Since we do not require and particular machine for execution it can be removed.
* Fix: Robustly parse certs and provide specific errors
* Add test for expired certificate failure
* fixing indentation
* fixing indentation
* Update SecurityActionTest.java
* Update SecurityActionTest.java for correcting the testcase
* Fix: Provide indentation fix
* Fixing Deduplication in test
ALl deployments received update to averageUtilization cpu. This should allow us to stay ahead of the curve of traffic and create instances before we cpu reached the limit.
Frontend cpu allocation has caused "noise neighbors" problem with pods assigned to nodes where there's not enough bursting capacity, so I increased it.
Adjusted rest of the deployments according to their utilization.
Missing resource requests as well as metrics for when to evict resource
produced situation when under load k8s struggled to assign pods. This
adds default resource requirements based on 2 weeks metrics and
instructions when resource should be evicted.
* Fix error handling in CopyDetailReportsAction
The action tries to record errors per registrar in an ImmutableMap, without realizing that
there may be duplicate keys due to retries.
Switched to the `buildKeepingLast` method to build the map.
* Addressing comments and rebase
Given an entity with auto-filled id fields (annotated with
@GeneratedValue, with null as initial value), when inserting it
using Hibernate, the id fields will be filled with non-nulls even
if the transaction fails.
If the same entity instance is used again in a retry, Hibernate mistakes
it as a detached entity and raises an error.
The work around is to make a new copy of the entity in each transaction.
This PR applies this pattern to affected entity types.
We considered applying this pattern to JpaTransactionManagerImpl's insert
method so that individual call sites do not have to change. However, we
decided against it because:
- It is unnecessary for entity types that do not have auto-filled id
- The JpaTransactionManager cannot tell if copying is cheap or
expensive. It is better exposing this to the user.
- The JpaTransactionManager needs to know how to clone entities. A new
interface may need to be introduced just for a handful of use cases.
It will now only throw errors on domain updates if a new contact/registrant has
been specified where none was previously present. This means that domain updates
on unrelated fields (e.g. nameserver changes) will succeed even if there is
existing contact data that the update is not removing.
This is a follow-up to #2781.
BUG=http://b/434958659
Some of these have been around since the Datastore days and are no
longer relevant (dealing with things like Datastore foreign keys). Let's
simplify things.
This works fairly similarly to the registry lock request and
verification mechanism. The request action generates a UUI which is
emailed (in link form) to the user in question. The frontend will send a
request to the verify action with the UUID and hopefully the action
should be finalized.
EPP password requests can be sent by anyone with edit-registrar
permissions and must be approved by an admin POC email.
Registry lock password resets can only be sent by primary contacts, and
are verified/performed by the user in question.
This prohibits all contact data on create and update EPP flows for both domain
and contact flows. It also refactors how default values on FeatureFlags work, as
it's safer to specify a single default on the flag itself rather than have to
specify it independently at a number of callsites (and potentially end up having
an inconsistent value). Domain updates on existing domains that still have
contact data will fail unless all contact data is removed, as a forcing function
to require registrars to rectify the situation prior to being able to do any
other kind of domain changes.
Contact-related flows that are still allowed after this point: Updating a domain
to remove all contacts from it, and deleting a contact object.
This corresponds to the Feb 2024 response profile section 1.2 and
implementation guide 1.3 respectively, now that we comply (or are, at
least closer to complying), with the Feb 2024 versions.
This should probably depend on https://github.com/google/nomulus/pull/2771
because that includes a small change included in the Feb 2024 version
This also updates the documentation to reference the proper areas of the
specifications.
From the response profile:
2.4.6. Registrar URL - The entity with the registrar role in the RDAP response
MUST contain a links member [RFC9083]. The links object MUST contain
the elements: value, identical to the the RDAP Base URL for the
Registrar as provided in the IANA “Registrar IDs” registry (i.e.,
https://www.iana.org/assignments/registrar-ids); rel:about, and href
containing the Registrar URL. Note: in cases where the Registry Operator
acts as sponsoring Registrar (e.g., IANA Registrar ID 9999), the href shall
contain a URL from the Registry.
This is necessary so that the total number of requests/responses adds up
correctly even though some fraction of them are only being recorded. It uses
stochastic rounding so that the totals add up correctly even when the reciprocal
of the ratio isn't an integer.
This is a follow-up to PR #2772.
This ratio defaults to 1.0 (i.e. all metrics will be recorded), but we will set
it much lower in sandbox and production, probably something closer to 0.01. This
will reduce recorded metrics volume and thus StackDriver cost, while still
retaining enough data for overall performance monitoring.
This is handled stochastically, so as to not require any coordination between
Java threads or GKE pods/clusters, as alternative approaches would (i.e. using a
counter and recording every Nth, or throttling to a max metrics qps).
In RDAP, domain queries are the most common by a factor of like 40,000
so we should optimize these as much as possible. We already have an EPP
resource / foreign key cache which does improve performance somewhat but
looking at some sample logs, it only cuts the RDAP request times by like
40% (looking at requests for the same domain a few seconds apart).
History entries don't change often, so we should cache them to make
subsequent queries faster as well. In addition, we're only caching two
fields per repo ID (modification time, registrar ID) so we can cache
more entries than we can for the EPP resource cache (which stores large
objects).
Pubapi actions should always use cache, regardless of the config
settings on caching.
In EppResource.java, the original `loadCached(Iterable<VKey>)`
method is renamed to `loadByCacheIfEnabled`. The original
`loadCached(Vkey)` method is renamed to `loadByCache` and always
uses cache.
In EppResourceUtils.java, the original `loadByForeignKeyCached`
method is renamed to `loadByForeignKeyByCacheIfEnabled`. A new
`loadByForeignKeyByCache` method, which always uses cache.
In ForeighKeyUtils.java, the original `loadCached` method is
renamed to `loadByCacheIfEnabled`, and a new `loadCached` method
is added which always uses cache.
Also added a `getContactsFromReplica` method in Registrar,
for use by RDAP actions.
A future PR will add the actions that save and use this object. That
future PR will also require loading RegistrarPoc objects given the
registrar ID, hence the change in that class.
This implements two type of changes:
1. changing the link type for things like the terms of service
2. adding the request URL to each and every link with the "value" field.
This is a bit tricky to implement because the links are generated in
various places, but we can implement it by adding it to the results
after generation.
See b/418782147 for more information
Add a new DnsWritersModule for use by the component classes.
To override the set of writers installed, we can easily overwrite this
file with a private version.
This is necessary because we'll use primary-contact emails as a way of
resetting passwords.
In the UI, don't allow editing of email address for primary contacts,
and don't allow addition/removal of the primary contact field
post-creation.
In the backend, make sure that all emails previously added still exist.
We're changing the way that allocation tokens work in suboptimal (i.e. incorrect) situations in the domain check, creation, and renewal process. Currently, if a token is not applicable, in any way, to any of the operations (including when a check has multiple operations requested) we return some variation of "Allocation token not valid" for all of those options. We wish to allow for a more lenient process, where if a token is "not applicable" instead of "invalid", we just pass through that part of the request as if the token were not there.
Types of errors that will remain catastrophic, where we'll basically return a token error immediately in all cases:
- nonexistent or null token
- token is assigned to a particular domain and the request isn't for that domain
- token is not valid for this registrar
- token is a single-use token that has already been redeemed
- token has a promotional schedule and it's no longer valid
Types of errors that will now be a silent pass-through, as if the user did not issue a token:
- token is not allowed for this TLD
- token has a discount, is not valid for premium names, and the domain name is premium
- token does not allow the provided EPP action
Currently, the last three types of errors cause that generic "token invalid" message but in the future, we'll pass the requests through as if the user did not pass in a token. This does allow for a default token to apply to these requests if available, meaning that it's possible that a single DomainCheckFlow with multiple check requests could use the provided token for some check(s), and a default token for others.
The flip side of this is that if the user passes in a catastrophically invalid token (the first five error messages above), we will return that result to any/all checks that they request, even if there are other issues with that request (e.g. the domain is reserved or already registered).
See b/315504612 for more details and background
We suspected this could be a cause of optimistic locking failures
(because long transactions would lead to optimistic locks not being
released) but this didn't end up being the case. Let's remove this to
reduce log spam.
1. This doesn't remove the SQL tables yet (this is necessary to pass
tests and also good practice just in case we need or want to look at
history for a little bit)
2. This also removes the Registrar, RegistrarPoc, and User base classes
that were only necessary because we were saving copies of those
objects in the old history classes.
We no longer need to union GKE+GAE logs since we've moved all production
traffic to GKE only.
For testing, I copied the affected *_test.sql files to Bigquery, removed
all the "-alpha" bits, and changed the dates to 20250301 and 20250331
and ran them to make sure they returned the expected data.
Now that we have effective global sessions thanks to #2734, there is no
longer a need to keep the number of pods on the EPP service static.
We are also not vulnerable to random pod restarts. K8s never guarantees
perpetual pod lifetime anyway, and not having to be at its mercy is
certainly a relief.
It turns out period can be used in the URI, such as in
"urn:ietf:params:xml:ns:fee-0.12". I don't think pipe is used, at least
not according to EPP URI namespace naming convention.
Ideally we'd use serialization, but using the default serialization runs
the risk of it being platform/JDK dependent, so a new deployment might
not be able to deserialize existing cookies. A custom serializer that
guarantees stability would have been needed.
This was added back in early 2018 long ago to enable promotions, but
since then (and for many years) we've added the ability to run
promotions on the tokens themselves, rather than relying on custom Java
classes.
This will make the changes for b/315504612 much easier, as that will
split up token validation into "is this token valid in general?" and "is
this token valid for this domain/action?"
This can potentially help even more with serializable transaction
failures (optimistic locking exceptions, which are expected to occur
somewhat frequently).
With six attempts, we will sleep at most five times, for
100+200+400+800+1600 ms each, for a total of at most 3.1 seconds (much
less than the EPP maximum which I believe (?) to be 30 seconds.
In addition, we add a 20% skew in an attempt to spread out
possibly-conflicting transaction retries.
This changes the code to only save console histories of this type. We
keep the old Java code (and, necessarily, the corresponding SQL code)
for now because there's no harm in doing so and we want to avoid hastily
deleting too much.
The SQL statement was incorrectly flooring to zero one layer too deep, which was
negating all negative transaction report rows (which occur most frequently when
a domain in the autorenew grace period is deleted). I've changed it so that it
now only floors to zero at the report level, which still solves the issue
reported in http://b/290228682 but whose original fix caused the issue
http://b/344645788
This bug was introduced in https://github.com/google/nomulus/pull/2074
I tested this by running the new query against the DB for 2024 Q4 using the
registrar that was having issues and confirmed that the total renewal numbers
for .app now match with the sum total of what we invoiced for the last three
months of 2024.
This doesn't check for correctness (we have other scripts that do that)
but just that the service is available at all (the other scripts do not
do that).
This should, and will, be configured with a scheduled trigger in GCB (for us, in
the domain-registry-dev project) and configuration to send some sort of
pub/sub notification on failure (for us, this is already set up on
domain-registry-dev and it sends messages to the "Domain Registry
Notifications" chat channel.
We've seen this issue happen more often than not recently, where GAE
canary deployment is stuck for about 10 min and the failed. The reason
is not clear, but delete the canary version prior to a deployment always
fixes the issue.
We use the $environment property to set the console config. If it is not
given, 'alpha' is used, which has the same effect as 'production'.
TESTED=ran :jetty:copyConsole with
-Penvironment=(sandbox|production|alpha) and checked the resulting js
file.
For GKE all tasks should be on backend, BSA was on its own service
because of egress IP constraint.
Also made it possible to specify a timeout for the Cloud Scheduler job,
with the default (3m) suitable for most tasks.
There are two session cookies, JSESSIONID, which is set by Jetty, and
GCLB, which is set by the Gateway.
In one session, every request other than the first one (the <hello>)
should have the same GCLB value, and every request after a successful
<login> should have the same JSESSIONID.
With these two metadata, we should be able to trace all requests that
*should* belong to the same session and debug issues with session
mismatch (if any).
There can be delays in releasing predicate locks when we have
transactions that are long-lived -- even delays in releasing predicate
locks acquired by shorter-lived transactions. Logging the transaction
duration will allow us to get a sense as to transaction durations during
busy times.
This can happen on public endpoints (in pubapi) where the service is
behind IAP but all users (including not-logged-in ones) are allowed. IAP
will add an OIDC token with no email field in the request header.
This removes logic from an inner helper method so that it becomes more clear
from callsites within each test exactly which behavior is expected from those
test conditions.
We now pin to postgreSQL v17 when running tests, which means that minor
version might increase without our intervention. This causes (at least)
the comment in the golden schema to change, and failing the test as a
result.
This PR adds the ability to strip lines that we deem as comment from the
comparison, so we don't have to do trivial upgrades to the gold schema
whenever there's minor version upgrade.
This allows us to easily tell which tag was deployed.
Also set the gateway to use named address so they are stable, and so
that we can attach an IPv6 record to it. Auto-provisioned addresses are
IPv4 only.
Failed pg_dump may not leave a file, failing the subsequent diffing and
causing the verifier to return success.
The verifier should abort in this case.
GKE logs are routed to a different dataset and the table is different.
The structs to look for are also different (jsonPayload vs textPayload
or protoPayload).
TESTED=Ran the resulting query in crash.
I obtained access to an IBM s390x VM so I thought I'd see how multi-arch
Nomulus is.
Our main application is in Java so it is already multi-arch, but several
tests use docker images that are by default x64. Luckily postgres has an
s390x port, but selenium does not. So I had to disable Screenshot tests
when the arch is not amd64.
This isn't a situation we'll encounter often, but if the client has an
allocation token that's valid for premium domains that gives a 0 cost,
we shouldn't require them to include the fee extension when creating the
domain. We already don't require it for standard domains.
We use this now for the DomainDeleteFlow in an attempt to figure out
what statements it's running (cross-referencing that with PSQL's own
statement logging to find slow statements).
This is kinda nonsensical because this use case is trying to apply a
single use token multiple times in the same domain:check request --
like, trying to use a single-use token for both create, renew, and
transfer while having a $0 create price and a premium renewal price.
This change doesn't affect any actual business / costs, since SPECIFIED
token renewal prices were already set on the BillingRecurrence
Instead of using a separate RenewalPriceInfo object, just map the
behavior (if it exists) onto the BillingRecurrence with a special
carve-out, as always, for anchor tenants (note: this shouldn't matter
much since anchor tenants *should* use NONPREMIUM renewal tokens anyway,
but just in case, double-check).
This also fixes DomainPricingLogic to treat a multiyear create as a
one-year-create + n-minus-1-year-renewal for cases where either the
creation or the renewal (or both) are nonpremium.
We are required to respond to HTTP(S) requests on port 80/443 on the
same domain where we serve port 43 WHOIS requests. The proxy already
does this by redirecting to the web WHOIS lookup page on the marketing
website.
This PR makes it so that requests to port 80/443 can be routed to the
proxy for redirect.
TESTED=tested on crash and the redirect works.
SecretManager based keyring not included in keyring bindings, resulting
in runtime failure.
We should simply keyring bindings. There is no use case for multiple
implementations. See b/388835696.
Knonw nested transact calls found in sandbox have been refactored away.
Enable logging in production to catch any missing cases. Logging is
throttled at 1 message per minute per VM.
k8s does not have a way to expose a global load balancer with TCP
endpoints, and setting up node port-based routing is a chore, even with
Terraform (which is what we did with the standalone proxy).
We will use Cloud DNS's geolocation routing policy to ensure that
clients connect to the endpoint closest to them.
- We never delete rows from DomainHistory (and even if we do in the
future, they'll be old / the references won't matter)
- This is likely creating lock contention when lots of requests come
through at once for domains with many DomainHistory entries
There are some breaking method changes in the 10.x.y versions and we're encountering exceptions when trying to run the flywayMigrate task thanks to those.
For now it only includes two options (domain deletion and domain
suspension). In the future, as necessary, we can add other actions but
this seems like a relatively simple starting point (actions like bulk
updates are much more conceptually complex).
This might help alleviate DB transaction contention on the PollMessage table. A
lower transaction isolation level is safe because acking a poll message is
idempotent: there are only two things it does, either delete a poll message or
take a recurring one from the past and set it to be a year in the future from
the date in the past. Both of these operations will always yield the same final
result even if executed multiple times simultaneously for some reason.
It doesn't need a higher transaction isolation level as it's only loading a given poll
message once, and we want to avoid putting any kind of locks on the PollMessage table
as it seems to be having contention issues. Note that the poll message request flow
is by far the most frequent code that touches the PollMessage table, as there are many
many requests every minute from dozens of registrars, but much fewer poll messages
than that to actually ACK.
We only include the deletion time if the domain is in the 5-day
PENDING_DELETE period after the 30 day REDEMPTION period. For all other
domains, we just have an empty string as that field.
This is behind a feature flag so that we can control when it is enabled
It was failing when any kind of transfer data was present, even completed
transfer data. Note that completed transfer data persists on a domain
indefinitely until/unless a new transfer is requested.
BUG= http://b/377328244
It seems like `/usr/bin/python` is no longer symlinked to the `python3`
binary in the `gcr.io/cloud-builders/git` image.
I've sent out a separate fix to upstream to change the shebang.
https://gerrit-review.git.corp.google.com/c/gcompute-tools/+/439501
But in the meantime, we need this temporary fix for the release to
build.
These flags are suggested by GitHub support to disable reusing caches
during Gradle build. They think that could fix the intermittent error
message:
```
Encountered a fatal error while running "/opt/hostedtoolcache/CodeQL/2.19.0/x64/codeql/codeql database finalize --finalize-dataset --threads=4 --ram=14576 --verbosity=progress++ /home/runner/work/_temp/codeql_databases/java". Exit code was 32 and last log line was: CodeQL detected code written in Java/Kotlin but could not process any of it. For more information, review our troubleshooting guide at https://gh.io/troubleshooting-code-scanning/no-source-code-seen-during-build . See the logs for more details.
```
This fails for me every day for some reason (starting about a month
ago). The same commit went through the workflow fine when the action was
triggered by a push.
I think there's no reason for us to have a cron run as the changes to the
master branch can only come from commit pushes.
Newer versions of GPG (v.2.4.5 in my case) has uses different wording
then what's available in our build image (and Ubuntu I suspect). For
example it says "rsa2048" instead of "2048-bit RSA".
Make the tests work in both cases. Admittedly we cannot check for the
string RSA/rsa easily, but I don't think it matters much for tests.
It looks like /usr/bin/python *may* no longer exists in the latest cloud
builder git image. I ran the latest image and logged into it to verify
that /usr/bin/python3 does exist on 9/25, and again on 9/26 where it
re-appeared.
I think it is generally a good idea to not rely on it being there going
forward.
* Include discount price in domai n pricing
* Partial progress in logic
* Tests and logic passing
* Change pricing for multi year create
* Tests for discount pricing logic
* Token currency check
* Add some comments
* Java formatting
* Discount price to Optional
* Change discount price to be optional nullable
* Re-add deleted tests
Depending on if a "--gke" parameter (must be the last one) is passed,
the deployer constructs the corresponding URIs for GAE or GKE
accordingly.
TESTED=Used the deployer to deploy tasks to alpha and verified that they
run on GKE.
* Drop the `transact` call in Id services
All usages already routed through `tm().allocateId()`, which is
guaranteed to be in a transaction.
* Addressing reviews
An error occurred: {{ errorMessage }}.<br/><br/>Please double-check the
verification code and try again.
</div>
</div>
} @else {
<divclass="console-app__password-reset-verify">
<h1class="mat-headline-4">{{ type }} password reset</h1>
<password-input-form-component
[displayOldPasswordField]="false"
[formGroup]="passwordUpdateForm!"
(submitResults)="save($event)"
/>
</div>
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.