Basically, what happened is that the cache's expireAfterWrite was being
called some number of milliseconds (say, 50-100) after the transaction
was started. That method used the transaction time instead of the
current time, so as a result the entries were sticking around 50-100ms
longer in the cache than they should have been.
This fix contains two parts, each of which I believe would be sufficient
on their own to fix the issue:
1. Use the currentTime passed in in Expiry::expireAfterCreate
2. Use the transaction time in the cache's Ticker. This keeps everything
on the same schedule.
During the release process, we are seeing the message "Gradle build daemon disappeared unexpectedly (it may have been killed or may have crashed)" which seemingly can be caused by OOMs
I analyzed SQL statements run during the following flows and EXPLAIN
ANALYZEd each of them to figure out if there are any additional hash
indexes we could add that could be particularly helpful. Note: it's not
worth adding a hash index on the host_repo_id field in DomainHost
because so many rows (domains) use the same host.
- domain create
- domain delete
- domain info
- domain renew
- domain update
- host create
- host delete
- host update
I skipped the ones that use the read-only replica, as well as contact
flows (we're getting rid of them), and domain transfer/restore-related
flows as those are extremely infrequent.
Updates (AKA merges) run an extra SELECT statement to figure out if the
resource exists so that it can merge the entity into the existing object
in Hibernate's schema. When we're inserting new rows (such as new poll
messages or resource creates), we know that we don't need to do that
merge. Doing this should save us some SELECT statements (this has borne
out to be the truth in alpha)
This should help in instances of popular domains dropping, since we
won't need to do an additional two database loads every time (assuming
the deletion time is in the future).
When running the action in sandbox on 1.5M domains, it failed a few times
updating individual domains (requiring a manual restart of the entire action).
It's better to just log the individual failures for manual inspection and then
otherwise continue running the action to process the vast majority of other
updates that won't fail.
BUG = http://b/439636188
Add a flag to the CreateCdnsTld command to bypass the dns name format
check in Sandbox (limiting names to `*.test.`). With this flag, we
can create TLDs for RST testing in Sandbox.
Note that if the new flag is wrongly set for a disallowed name, the
request to the Cloud DNS API will fail. The format check in the command
just provides a user-friendly error message.
I went through all the SQL statements generated by some sample
DomainCreateFlow and DomainDeleteFlow cases to find situations where we
were either SELECTing from, or UPDATEing, tables with a direct "field =
value" format. These are the situations that I found where we can add
hash indexes. This does two things:
1. Makes these queries slight faster, since these are usually queries on
columns that are either unique or very close to unique, and O(1) is
faster than O(log(n))
2. Spreads around the optimistic predicate locks on the previously-used
btree indexes. Many of our serialization errors came from the fact
that we were autogenerating incrementing ID values for various
tables, meaning that SELECTs, INSERTs, and UPDATEs would all try to
take predicate locks out on the same page of the btree index. Using a
hash index means that the page locks will be spread out to various
index pages, rather than conflicting with each other.
Running load tests on alpha I see significant improvements in speed and
error rates. Speed is hard to quantify due to the nature of the way the
load tests distribute tasks among the queues but it could be more than
50% improvement, and serialization errors in the logs drop by more than
90%.
This implements the first part of Minimum Data Set phase 3, wherein we delete
all contact data. This action is necessary to leave a permanent record on the
domain (in the form of a domain history entry) documenting when the contacts
were removed by the administrative user.
Then, after this has finished removing all contact assocations, we can simply
empty out or drop the Contact/ContactHistory tables and associated join tables.
* Skip user loading for proxy service account
Reduces database load by skipping the User entity lookup for the proxy
service account during OIDC authentication.
The high volume of EPP "hello" and "login" commands from the proxy
service account results in a constant database load. These lookups
are unnecessary as the proxy service account is not expected to have a
corresponding User object.
This change optimizes the authentication flow by checking for the proxy
service account email *before* attempting to load a User from the
database. This bypasses the database transaction entirely for these
high-volume requests.
This approach is more efficient than caching, as it eliminates the
database lookup for the proxy service account altogether, rather than
just caching the result.
* comment added and service account llokup time improved
* comment updated for more clarity
* Add cache for User entities in OIDC auth flow
* refactor: Address review feedback
- Refactor database call into a single, reusable method
- Increase the default cache size to 200
- Remove .recordStats() and using spy for testing
- Split unit tests into separate implementation test that use Mockito spies instead of checking internal cache stats
We've moved these over to the User class, so we should remove these for
clarity. In addition, we should make it clear (in Java at least) that
the field in the RegistryLock object refers to the email address used
for the lock in question.
This allows us to also check / modify the CharlestonRoad registrar in
the console, and also allows us to test actions (like password reset)
using that registrar in the prod environment.
* Fix OOM in UploadBsaUnavailableDomains action
The action was using string concatenation to generate the upload content.
This causes an OOM when string length exceeds 25MB on our current VM.
This PR witches to streaming upload.
Also added an HTTP upload test.
* Fix OOM in UploadBsaUnavailableDomains action
The action was using string concatenation to generate the upload content.
This causes an OOM when string length exceeds 25MB on our current VM.
This PR witches to streaming upload.
Also added an HTTP upload test.
Error happened in the case that an unblockable name reported with
'Registered' as reason has been deregistered. We tried to check the
deletion time of the domain to decide if this is a transient error
that is no worth reporting. However, we forgot that we do not have
the domain key in this case.
As best-effort action, and with a case that rarely happens, we decide
not to make the optimization (staleness check) in thise case.
Note: this still includes "contacts" for registrars, which are actually
a different concept that we call RegistrarPoc. That's different from
"Contact" objects, e.g. registrant.
The TLD is technically valid but it doesn't exist for us -- we should
return 404 instead of 400 in these situations according to the RDAP
conformance docs
Load balancer / internal redirections can result in the final request
URL lacking "https" when finally getting to the servlet. As a result,
even if you use https in the request, the resulting URL can be plain
http.
We need to include the actual (HTTPS) URL in the output, so replace it.
This increases hikari fetch size to 40 from 20 in order to decrease the
amount of round trips
This also sets lower CPU as we seem to have overshot CPU consumtion
This also set min replicas to 8 for EPP and max to 16 as we've been
running on 8-10 for the last week
Tested locally and on alpha with dummy values (and throwing an
exception).
I was able to reuse a bit of code from the EPP password reset, but not
all of it.
This is the first in a series of PRs to implement the expiry access period
(XAP). The overall fee schedules will be set in YAML config files, so the only
DB change necessary should be this single new boolean column on the Tld entity,
which defaults to false so as to require XAP explicitly being turned on for a
given TLD.
BUG=http://b/437398822
Postgresql-17.6 introduces two new lines in pg_dump output as a
security feature: `\restricted {HASH}` and `\unrestricted {HASH}`.
We filter out lines starting with these two prefixes when comparing
schemas.
The db upgrade also adds two empty lines to the pg_dump output. We
know ignore all empty lines when comparing schemas.
It's necessary to remove the GAE-related code (and use GKE launch
commands instead), and we might as well remove contact-related fields
and actions because of the upcoming move to the minimum data set.
This reverts commit 5cef2dd8b5.
We faced CPU quota issue with standard machine type, so rolling back to c4
for now to monitor server performance and decide if we want to try to
downgrade again in the future.
We probably want this to run before the billing recurrence expansion
pipeline just in case there are any domains that should be deleted
before their billing recurrence gets expanded.
nodeSelector can limit scheduling capabilities of k8s, which leads to delays in assigning new workloads. Since we do not require and particular machine for execution it can be removed.
* Fix: Robustly parse certs and provide specific errors
* Add test for expired certificate failure
* fixing indentation
* fixing indentation
* Update SecurityActionTest.java
* Update SecurityActionTest.java for correcting the testcase
* Fix: Provide indentation fix
* Fixing Deduplication in test
ALl deployments received update to averageUtilization cpu. This should allow us to stay ahead of the curve of traffic and create instances before we cpu reached the limit.
Frontend cpu allocation has caused "noise neighbors" problem with pods assigned to nodes where there's not enough bursting capacity, so I increased it.
Adjusted rest of the deployments according to their utilization.