The primary annoyance with this is that it means we need (or at least,
should) split all tests that use the fee extension into two separate
tests -- one that simulates non-prod environments, and one that
simulates prod environments. This leads to duplication of many tests but
that's fine since this is theoretically temporary.
* Configure cloud scheduler to trigger MoSAPI SLA status to cloud monitoring in production
- We have kept this job to trigger for every 3 minutes so that we get near to real time update for our task.
- This will not trigger metrics for now as we have not written Metrics triggering logic yet
- Logs are added
* Change Trigger scheduling from 3 minutes to 5 minutes
This primarily affects the EPP greeting. We already were erroring out when any
contact flows attempted to be run; this should just prevent registrars from even
trying them at all.
This PR is designed to be minimally invasive, and does not remove any of the
contact flows or Jakarta XML/XJC objects/files themselves. That can be done
later as a follow-up.
Also note that the contact namespace urn:ietf:params:xml:ns:contact-1.0 is still
present for now in RDE exports, but I'll remove that subsequently as well.
BUG= http://b/475506288
This means that attempting to add a status that is already present will now
fail, and attempting to remove a status that is not present will also now fail.
This also refactors the existing checks into a single verify method, rather than
having to call three separate methods from every callsite.
BUG= http://b/474645068
* add cloud profiler to dockerfile and start script
* add apt-get update
* change in cb machine type for nomulus
* fix typo
* add max worker limit to gradle tests
* Switch to root before doing apt-get
* correct dockerfile
* jetty/Dockerfile
* profiler service conditional to kubernetes container name
Many of the actual fee extension changes are based off Weimin's PR
https://github.com/google/nomulus/pull/2912, though this makes some
additional changes based on the XML schema and description from RFC 8748.
This adds tests for the DomainCheckFlow which is the most complex and
thorough user of the fee extension, but we'll want to add further tests
to the other domain flows to make sure they're handled correctly.
It just makes it possible to delete allocation tokens, otherwise we need
to do a linear search over the entire Domain and DomainHistory tables if
we ever want to delete something.
The RST testing expects us to fail if they try to remove an IP from a
host that already doesn't that have that IP, or to add one that already
exists (ditto on both for a domain's nameservers). I don't really see an
issue with our previous no-op implementation, but we need to do this to
pass the tests.
We need to have this enabled in sandbox, but we wish to wait to enable
it for production to make sure that the implementation is correct and
that clients can use it.
Soon we'll want to do something similar (but the opposite) with the old
fee extensions, where we **only** serve them in production (or maybe
unit test as well). That will allow us to pass the RST tests that depend
on only having the fee extension 1.0.
This is / will be required in https://datatracker.ietf.org/doc/rfc8748/.
I split this out from the rest of the fee-extension testing so that it
can be easily visible.
This commit introduces a new backend endpoint at `/_dr/task/triggerMosApiServiceState` that initiates the process of fetching the latest service states for all TLDs from the MoSAPI endpoint and exporting them as metrics to Cloud Monitoring.
The key changes include:
- A new `TriggerServiceStateAction` class that handles the GET request to the new endpoint.
- Logic within `MosApiStateService` to concurrently fetch states for all configured TLDs.
- A new `MosApiMetrics` class (currently a placeholder) responsible for sending the collected states to the monitoring service.
- Unit tests for the new action and the updated service logic.
This endpoint will be called periodically to ensure that the MosApi service health metrics are kept up-to-date.
this is coming from the schema https://datatracker.ietf.org/doc/rfc8748/
section 6.1. The class, that we use for "premium" notes, moved from the
command to the object itself.
I don't know where in the spec these are explicitly disallowed, but it
seems like good practice and we'll fail the RST tests if we don't
disallow them.
* Add GetServiceState action for MoSAPI service monitoring
Implements the `/api/mosapi/getServiceState` endpoint to retrieve service health summaries for TLDs from the MoSAPI system.
- Introduces `GetServiceStateAction` to fetch TLD service status.
- Implements `MosApiStateService` to transform raw MoSAPI responses into a curated `ServiceStateSummary`.
- Uses concurrent processing with a fixed thread pool to fetch states for all configured TLDs efficiently while respecting MoSAPI rate limits.
junit test added
* Refactor MoSAPI models to records and address review nits
- Convert model classes to Java records for conciseness and immutability.
- Update unit tests to use Java text blocks for improved JSON readability.
- Simplify service and action layers by removing redundant logic and logging.
- Fix configuration nits regarding primitive types and comment formatting.
* Consolidate MoSAPI models and enhance null-safety
- Moves model records into a single MosApiModels.java file.
- Switches to ImmutableList/ImmutableMap with non-null defaults in constructors.
- Removes redundant pass-through methods in MosApiStateService.
- Updates tests to use Java Text Blocks and non-null collection assertions.
* Improve MoSAPI client error handling and clean up data models
Refactors the MoSAPI monitoring client to be more robust against
infrastructure failures
* Refactor: use nullToEmptyImmutableCopy() for MoSAPI models
Standardize null-handling in model classes by using the Nomulus
`nullToEmptyImmutableCopy()` utility. This ensures consistent API
responses with empty lists instead of omitted fields.
* Add RST support in Sandbox
Added RST test label files as resources.
Added a RstTmchUtils class that loads appropriate labels according to
TLD pattern.
Temporarily changed label fetching in production to include the TLD
string, so that the new class may know which set of labels to use.
* Addressing comments
* Addressing comments
We do most of these on host create already so we should also do them on
host checks. The only added change is the character validation (our
existing hostnames all match these).
2306 signifies something that is syntactically valid but semantically
invalid (like if someone tried to register a .com domain). These errors
are for domain syntax that could never be valid, thus we should throw a
syntax exception instead of a policy exception.
Now that we have transitioned to the minimum dataset, we no longer
support any actions on contacts (and by the time this is merged /
deployed, all contacts will be deleted). We should just throw an
appropriate exception on all contact-related flows. We don't delete the
flows themselves, so that we can have an appropriate error message.
We also keep all the flows and XML templates around individually for now because we may be
required to continue to differentiate the requests in ICANN activity
reporting (e.g. srs-cont-create vs srs-cont-delete)
This is a follow-up to PR #2908, which relaxed this restriction on bare TLDs
only, but now we also allow it systemwide on domains and hostnames as well. The
rules against hyphens in these positions are still enforced on all parts of the
domain name except the last one. Correct handling of multi-part TLDs in this
regard is out of scope in this PR; a multi-part TLD that looked something like
".zz--foobar.foobar" would still fail validation. (But of course you cannot a
priori know just from looking at a 3-part string whether it might be a hostname
on a normal TLD, or a domain name on a 2-part TLD.)
This also has some annoying interactions with a trailing dot (indicating the
root), which need to be preserved, but otherwise don't affect how TLD validation
is handled.
BUG= http://b/471013082
The canonicalizeHostname() helper method is only suitable for use with domain
names or host names. It does not work on bare TLDs, because a bare TLD can
have hyphens in the third and fourth position without necessarily being an IDN.
Note that the configure TLD command already correctly allows TLDs with such
names to be created.
Note that we are still enforcing that the TLDs to be added exist, so they have
to pass all TLD naming requirements that are enforced on creating TLDs, and we
are still lowercasing the TLD names passed as arguments here (though we're no
longer punycoding them, although arguably that's not super useful on
command-line params anyway).
BUG= http://b/471013082
We used to publish test artifacts to a Maven repo on GCS, for use by
schema tests. For this to work with Kokoro, the GCS bucket must be
accessible to all users.
To comply with the no-public-user requirement, we store the necessary
jars at at well-known bucket and map them into Kokoro. This strategy
cannot be used on the Maven repo because only a small number of files
with fixed names may be mapped. With the Maven repo, there are too many
files to map.
This is a decently simple wrapper around the previously-created
BulkDomainTransferAction that batches a provided list of domains up and
sends them along to be transferred.
This PR finds instances where we previously checked if the feature flag
for contacts-prohibited was set and removes those checks, making the
contacts-prohibited behavior the only behavior. Because the tests didn't
have that feature flag set, this means we need to change a ton of tests
to remove contact references.
* Add mosapi client to intract with ICANN's monitoring system
This change introduces a comprehensive client to interact with ICANN's Monitoring System API (MoSAPI). This provides direct, automated access to critical registry health and compliance data, moving Nomulus towards a more proactive monitoring posture.
A core, stateless MosApiClient that manages communication and authentication with the MoSAPI service using TLS client certificates.
* Resolve review feedback & upgrade to OkHttp3 client
This commit addresses and resolves all outstanding review comments, primarily encompassing a shift to OkHttp3, security configuration cleanup, and general code improvements.
* **Review:** Addressed and resolved all pending review comments.
* **Feature:** Switched the underlying HTTP client implementation to [OkHttp3](https://square.github.io/okhttp/).
* **Configuration:** Consolidated TLS Certificates-related configuration into the dedicated configuration area.
* **Cleanup:** Removed unused components (`HttpUtils` and `HttpModule`) and performed general code cleanup.
* **Quality:** Improved exception handling logic for better robustness.
* Refactor and fix Mosapi exception handling
Addresses code review feedback and resulting test failures.
- Flattens package structure by moving MosApiException and its test.
- Corrects exception handling to ensure MosApiAuthorizationException
propagates correctly, before the general exception handler.
- Adds a default case to the MosApiException factory for robustness.
- Uses lowercase for placeholder TLDs in default-config.yaml.
* Refactor and improve Mosapi client implementation
Simplifying URL validation with Guava
Preconditions and refining exception handling to use `Throwables`.
* Refactor precondition checks using project specific utility
This will be necessary if we wish to do larger BTAPPA transfers (or
other types of transfers, I suppose). The nomulus command-line tool is
not fast enough to quickly transfer thousands of domains within a
reasonable timeframe.
After each deployment in sandbox or production, move the artifacts from
the corresponding release to a well-known location so that they can be
mapped to Kokoro in presubmit tests. The Kokoro-mapping does not need
public access to the GCS bucket.
The artifacts include the postgresql schema jar, the nomulus release
jar, and the uber jar of the nomulus schema integration test classes.
Every jar name consists of a fixed prefix and the environment. Each jar
of a new deployment overrides the previous copy.
These are old/pointless now that we've migrated to GKE. Note that this
doesn't update anything in the docs/ folder, as that's a much larger
project that should be done on its own.
Add a flag indicating that a Sandbox TLD should use the production
servers.
No additional TLD name pattern checks. Cloud DNS has an allowlist for
names that may use production servers.
Also updated default descriptive name generation: dropping the trailing
'.', and replacing remaining dots with '_'.
This contains the same codepoints from the
core/src/main/java/google/registry/idn/Latin-IDN.xml file, just in the old .txt
IDN format that Nomulus actually ingests.
We don't need to support the mix of GAE and GKE any more so we can get
rid of the GaeService bits and unify everything under one constant
service. This also allows us to reduce the number of services down to
four (FE, BE, PUBAPI, console) which is nice.
We've previously been using Scrypt since PR #2191 which, while being a
memory-hard slow function, isn't the optimal solution according to the
OWASP recommendations. While we could get away with increasing the
parallelization parameter to 3, it's better to just switch to the
most-recommended solution if we're switching things up anyway.
For the transition, we do something similar to PR #2191 where if the
previous-algorithm's hash is successful, we re-hash with Argo2id and
store that version. By doing this, we should not need any intervention
for registrars who log in at any point during the transition period.
Much of this PR, especially the parts where we re-hash the passwords in
Argon2 instead of Scrypt upon login, is based on the code that was
eventually removed in #2310.
Reformat the current schema file for RFC 8748 final version. This was
adapted from v0.12 is not fully consistent with the final schema
This helps highlight the differences we missed in PR 2855 when we check
in the official schema.
The servlets, at this point now that we're off GAE, are only used for
the test server (and, indirectly, in one BSA test). Instead of having
them all remain separate, we can unify them in one test servlet that
lives in the test/ folder.
This removes one avenue of potential confusion w/r/t how request routing
actually works and where we would want to add new routing.
We need to stop using maven repo on GCS to store artifacts for the
schema compatibility tests. After public access is removed from GCS
buckets, Kokoro won't be able to access it: normal access will be
denied, and the repo is too large to map (copy) to Kokoro VM as a
resource.
This PR uploads the relevant jars to each release's folder. See
go/dr-gcs-public-access-prevention for details.
With GKE, we don't need the individual servlets because the services
aren't partitioned out the same way they were in GAE.
We keep FrontendServlet and BackendServlet around for now as they serve
as the backbone for the local RegistryTestServer (for testing things
like the console).
did some cursory tests on alpha and things seem to be unaffected -- I
was able to curl RDAP (pubapi) and create domains
It's been over half a year now since we last used any of these and we definitely
no longer have any intentions of ever using App Engine again.
BUG= http://b/457471639
Still part of b/454947209, removing references to WHOIS where we can. We
keep the registrar type and the column names (at least for now) because
changing those is much more complicated.
We no longer need this now that no contacts can be applied to any domains at all.
A follow-up PR in subsequent weeks will delete the column from the DB schema.
BUG= http://b/448619572
Previously, we would have separate database calls for mapping from
foreign key to repo ID and then from repo ID to object. This PR modifies
those calls to load the resource directly (the old system was an
artifact of the Datastore key-value storage system).
In this PR, we merge the load-resource-by-foreign-key calls into a
single database load, as well as adding a separate cache object for
(foreign key) -> (resource). Now we cache, and have separate cleaner
code paths, for fk -> resource, fk -> repo ID, and repo ID -> resource.
Also removes the unused RdeFragmenter class
* Support Fee Extension standard in rfc 8748
Adding support to the final version of RFC 8748.
Compared with draft-0.12, the only meaningful change is in the namespace.
The rest is either schema-tightening that reflects actual usage, or
optional server-side features that we do not support.
We reuse draft-0.12 tests, only changing namespace uris in the input and
output files for the new version.
* Addressing reviews
We haven't been serving this for a while, let's finally get rid of them.
We keep some Soy rules around in the presubmits file because we use some
Soy files as XML templates for EPP actions.
This doesn't make any underlying implementation details, and is mainly useful to
reduce the number of diffs in PR #2852 (which does change implementation
details) thus making that easier to review.
This allows us to specify a getter delegation to bypass Hibernate's limitations
on field types for the purposes of, e.g., using a sorted set in toString()
output rather than the base Hibernate unsorted HashSet type.
BUG=http://b/448631639
This affects FKs pointing to both Contact and ContactHistory. This is in
preparation to us deleting all rows in those two tables, and then subsequently
removing all application logic having to do with contacts entirely.
This is steps one and two of b/454947209
We already haven't been serving WHOIS for a while, so there's no point
in keeping the old code around. This can simplify some code paths in the
future (like, certain foreign-key-loads that are only used in WHOIS
queries).
Basically, what happened is that the cache's expireAfterWrite was being
called some number of milliseconds (say, 50-100) after the transaction
was started. That method used the transaction time instead of the
current time, so as a result the entries were sticking around 50-100ms
longer in the cache than they should have been.
This fix contains two parts, each of which I believe would be sufficient
on their own to fix the issue:
1. Use the currentTime passed in in Expiry::expireAfterCreate
2. Use the transaction time in the cache's Ticker. This keeps everything
on the same schedule.
During the release process, we are seeing the message "Gradle build daemon disappeared unexpectedly (it may have been killed or may have crashed)" which seemingly can be caused by OOMs
I analyzed SQL statements run during the following flows and EXPLAIN
ANALYZEd each of them to figure out if there are any additional hash
indexes we could add that could be particularly helpful. Note: it's not
worth adding a hash index on the host_repo_id field in DomainHost
because so many rows (domains) use the same host.
- domain create
- domain delete
- domain info
- domain renew
- domain update
- host create
- host delete
- host update
I skipped the ones that use the read-only replica, as well as contact
flows (we're getting rid of them), and domain transfer/restore-related
flows as those are extremely infrequent.
Updates (AKA merges) run an extra SELECT statement to figure out if the
resource exists so that it can merge the entity into the existing object
in Hibernate's schema. When we're inserting new rows (such as new poll
messages or resource creates), we know that we don't need to do that
merge. Doing this should save us some SELECT statements (this has borne
out to be the truth in alpha)
This should help in instances of popular domains dropping, since we
won't need to do an additional two database loads every time (assuming
the deletion time is in the future).
When running the action in sandbox on 1.5M domains, it failed a few times
updating individual domains (requiring a manual restart of the entire action).
It's better to just log the individual failures for manual inspection and then
otherwise continue running the action to process the vast majority of other
updates that won't fail.
BUG = http://b/439636188
Add a flag to the CreateCdnsTld command to bypass the dns name format
check in Sandbox (limiting names to `*.test.`). With this flag, we
can create TLDs for RST testing in Sandbox.
Note that if the new flag is wrongly set for a disallowed name, the
request to the Cloud DNS API will fail. The format check in the command
just provides a user-friendly error message.
I went through all the SQL statements generated by some sample
DomainCreateFlow and DomainDeleteFlow cases to find situations where we
were either SELECTing from, or UPDATEing, tables with a direct "field =
value" format. These are the situations that I found where we can add
hash indexes. This does two things:
1. Makes these queries slight faster, since these are usually queries on
columns that are either unique or very close to unique, and O(1) is
faster than O(log(n))
2. Spreads around the optimistic predicate locks on the previously-used
btree indexes. Many of our serialization errors came from the fact
that we were autogenerating incrementing ID values for various
tables, meaning that SELECTs, INSERTs, and UPDATEs would all try to
take predicate locks out on the same page of the btree index. Using a
hash index means that the page locks will be spread out to various
index pages, rather than conflicting with each other.
Running load tests on alpha I see significant improvements in speed and
error rates. Speed is hard to quantify due to the nature of the way the
load tests distribute tasks among the queues but it could be more than
50% improvement, and serialization errors in the logs drop by more than
90%.
This implements the first part of Minimum Data Set phase 3, wherein we delete
all contact data. This action is necessary to leave a permanent record on the
domain (in the form of a domain history entry) documenting when the contacts
were removed by the administrative user.
Then, after this has finished removing all contact assocations, we can simply
empty out or drop the Contact/ContactHistory tables and associated join tables.
* Skip user loading for proxy service account
Reduces database load by skipping the User entity lookup for the proxy
service account during OIDC authentication.
The high volume of EPP "hello" and "login" commands from the proxy
service account results in a constant database load. These lookups
are unnecessary as the proxy service account is not expected to have a
corresponding User object.
This change optimizes the authentication flow by checking for the proxy
service account email *before* attempting to load a User from the
database. This bypasses the database transaction entirely for these
high-volume requests.
This approach is more efficient than caching, as it eliminates the
database lookup for the proxy service account altogether, rather than
just caching the result.
* comment added and service account llokup time improved
* comment updated for more clarity
* Add cache for User entities in OIDC auth flow
* refactor: Address review feedback
- Refactor database call into a single, reusable method
- Increase the default cache size to 200
- Remove .recordStats() and using spy for testing
- Split unit tests into separate implementation test that use Mockito spies instead of checking internal cache stats
We've moved these over to the User class, so we should remove these for
clarity. In addition, we should make it clear (in Java at least) that
the field in the RegistryLock object refers to the email address used
for the lock in question.
This allows us to also check / modify the CharlestonRoad registrar in
the console, and also allows us to test actions (like password reset)
using that registrar in the prod environment.
* Fix OOM in UploadBsaUnavailableDomains action
The action was using string concatenation to generate the upload content.
This causes an OOM when string length exceeds 25MB on our current VM.
This PR witches to streaming upload.
Also added an HTTP upload test.
* Fix OOM in UploadBsaUnavailableDomains action
The action was using string concatenation to generate the upload content.
This causes an OOM when string length exceeds 25MB on our current VM.
This PR witches to streaming upload.
Also added an HTTP upload test.
Error happened in the case that an unblockable name reported with
'Registered' as reason has been deregistered. We tried to check the
deletion time of the domain to decide if this is a transient error
that is no worth reporting. However, we forgot that we do not have
the domain key in this case.
As best-effort action, and with a case that rarely happens, we decide
not to make the optimization (staleness check) in thise case.
Note: this still includes "contacts" for registrars, which are actually
a different concept that we call RegistrarPoc. That's different from
"Contact" objects, e.g. registrant.
The TLD is technically valid but it doesn't exist for us -- we should
return 404 instead of 400 in these situations according to the RDAP
conformance docs
Load balancer / internal redirections can result in the final request
URL lacking "https" when finally getting to the servlet. As a result,
even if you use https in the request, the resulting URL can be plain
http.
We need to include the actual (HTTPS) URL in the output, so replace it.
This increases hikari fetch size to 40 from 20 in order to decrease the
amount of round trips
This also sets lower CPU as we seem to have overshot CPU consumtion
This also set min replicas to 8 for EPP and max to 16 as we've been
running on 8-10 for the last week
Tested locally and on alpha with dummy values (and throwing an
exception).
I was able to reuse a bit of code from the EPP password reset, but not
all of it.
This is the first in a series of PRs to implement the expiry access period
(XAP). The overall fee schedules will be set in YAML config files, so the only
DB change necessary should be this single new boolean column on the Tld entity,
which defaults to false so as to require XAP explicitly being turned on for a
given TLD.
BUG=http://b/437398822
Postgresql-17.6 introduces two new lines in pg_dump output as a
security feature: `\restricted {HASH}` and `\unrestricted {HASH}`.
We filter out lines starting with these two prefixes when comparing
schemas.
The db upgrade also adds two empty lines to the pg_dump output. We
know ignore all empty lines when comparing schemas.
It's necessary to remove the GAE-related code (and use GKE launch
commands instead), and we might as well remove contact-related fields
and actions because of the upcoming move to the minimum data set.
This reverts commit 5cef2dd8b5.
We faced CPU quota issue with standard machine type, so rolling back to c4
for now to monitor server performance and decide if we want to try to
downgrade again in the future.
We probably want this to run before the billing recurrence expansion
pipeline just in case there are any domains that should be deleted
before their billing recurrence gets expanded.
nodeSelector can limit scheduling capabilities of k8s, which leads to delays in assigning new workloads. Since we do not require and particular machine for execution it can be removed.
* Fix: Robustly parse certs and provide specific errors
* Add test for expired certificate failure
* fixing indentation
* fixing indentation
* Update SecurityActionTest.java
* Update SecurityActionTest.java for correcting the testcase
* Fix: Provide indentation fix
* Fixing Deduplication in test
ALl deployments received update to averageUtilization cpu. This should allow us to stay ahead of the curve of traffic and create instances before we cpu reached the limit.
Frontend cpu allocation has caused "noise neighbors" problem with pods assigned to nodes where there's not enough bursting capacity, so I increased it.
Adjusted rest of the deployments according to their utilization.
Missing resource requests as well as metrics for when to evict resource
produced situation when under load k8s struggled to assign pods. This
adds default resource requirements based on 2 weeks metrics and
instructions when resource should be evicted.
* Fix error handling in CopyDetailReportsAction
The action tries to record errors per registrar in an ImmutableMap, without realizing that
there may be duplicate keys due to retries.
Switched to the `buildKeepingLast` method to build the map.
* Addressing comments and rebase
Given an entity with auto-filled id fields (annotated with
@GeneratedValue, with null as initial value), when inserting it
using Hibernate, the id fields will be filled with non-nulls even
if the transaction fails.
If the same entity instance is used again in a retry, Hibernate mistakes
it as a detached entity and raises an error.
The work around is to make a new copy of the entity in each transaction.
This PR applies this pattern to affected entity types.
We considered applying this pattern to JpaTransactionManagerImpl's insert
method so that individual call sites do not have to change. However, we
decided against it because:
- It is unnecessary for entity types that do not have auto-filled id
- The JpaTransactionManager cannot tell if copying is cheap or
expensive. It is better exposing this to the user.
- The JpaTransactionManager needs to know how to clone entities. A new
interface may need to be introduced just for a handful of use cases.
It will now only throw errors on domain updates if a new contact/registrant has
been specified where none was previously present. This means that domain updates
on unrelated fields (e.g. nameserver changes) will succeed even if there is
existing contact data that the update is not removing.
This is a follow-up to #2781.
BUG=http://b/434958659
Some of these have been around since the Datastore days and are no
longer relevant (dealing with things like Datastore foreign keys). Let's
simplify things.
This works fairly similarly to the registry lock request and
verification mechanism. The request action generates a UUI which is
emailed (in link form) to the user in question. The frontend will send a
request to the verify action with the UUID and hopefully the action
should be finalized.
EPP password requests can be sent by anyone with edit-registrar
permissions and must be approved by an admin POC email.
Registry lock password resets can only be sent by primary contacts, and
are verified/performed by the user in question.
This prohibits all contact data on create and update EPP flows for both domain
and contact flows. It also refactors how default values on FeatureFlags work, as
it's safer to specify a single default on the flag itself rather than have to
specify it independently at a number of callsites (and potentially end up having
an inconsistent value). Domain updates on existing domains that still have
contact data will fail unless all contact data is removed, as a forcing function
to require registrars to rectify the situation prior to being able to do any
other kind of domain changes.
Contact-related flows that are still allowed after this point: Updating a domain
to remove all contacts from it, and deleting a contact object.
This corresponds to the Feb 2024 response profile section 1.2 and
implementation guide 1.3 respectively, now that we comply (or are, at
least closer to complying), with the Feb 2024 versions.
This should probably depend on https://github.com/google/nomulus/pull/2771
because that includes a small change included in the Feb 2024 version
This also updates the documentation to reference the proper areas of the
specifications.
From the response profile:
2.4.6. Registrar URL - The entity with the registrar role in the RDAP response
MUST contain a links member [RFC9083]. The links object MUST contain
the elements: value, identical to the the RDAP Base URL for the
Registrar as provided in the IANA “Registrar IDs” registry (i.e.,
https://www.iana.org/assignments/registrar-ids); rel:about, and href
containing the Registrar URL. Note: in cases where the Registry Operator
acts as sponsoring Registrar (e.g., IANA Registrar ID 9999), the href shall
contain a URL from the Registry.
This is necessary so that the total number of requests/responses adds up
correctly even though some fraction of them are only being recorded. It uses
stochastic rounding so that the totals add up correctly even when the reciprocal
of the ratio isn't an integer.
This is a follow-up to PR #2772.
This ratio defaults to 1.0 (i.e. all metrics will be recorded), but we will set
it much lower in sandbox and production, probably something closer to 0.01. This
will reduce recorded metrics volume and thus StackDriver cost, while still
retaining enough data for overall performance monitoring.
This is handled stochastically, so as to not require any coordination between
Java threads or GKE pods/clusters, as alternative approaches would (i.e. using a
counter and recording every Nth, or throttling to a max metrics qps).
In RDAP, domain queries are the most common by a factor of like 40,000
so we should optimize these as much as possible. We already have an EPP
resource / foreign key cache which does improve performance somewhat but
looking at some sample logs, it only cuts the RDAP request times by like
40% (looking at requests for the same domain a few seconds apart).
History entries don't change often, so we should cache them to make
subsequent queries faster as well. In addition, we're only caching two
fields per repo ID (modification time, registrar ID) so we can cache
more entries than we can for the EPP resource cache (which stores large
objects).
Pubapi actions should always use cache, regardless of the config
settings on caching.
In EppResource.java, the original `loadCached(Iterable<VKey>)`
method is renamed to `loadByCacheIfEnabled`. The original
`loadCached(Vkey)` method is renamed to `loadByCache` and always
uses cache.
In EppResourceUtils.java, the original `loadByForeignKeyCached`
method is renamed to `loadByForeignKeyByCacheIfEnabled`. A new
`loadByForeignKeyByCache` method, which always uses cache.
In ForeighKeyUtils.java, the original `loadCached` method is
renamed to `loadByCacheIfEnabled`, and a new `loadCached` method
is added which always uses cache.
Also added a `getContactsFromReplica` method in Registrar,
for use by RDAP actions.
A future PR will add the actions that save and use this object. That
future PR will also require loading RegistrarPoc objects given the
registrar ID, hence the change in that class.
This implements two type of changes:
1. changing the link type for things like the terms of service
2. adding the request URL to each and every link with the "value" field.
This is a bit tricky to implement because the links are generated in
various places, but we can implement it by adding it to the results
after generation.
See b/418782147 for more information
Add a new DnsWritersModule for use by the component classes.
To override the set of writers installed, we can easily overwrite this
file with a private version.
This is necessary because we'll use primary-contact emails as a way of
resetting passwords.
In the UI, don't allow editing of email address for primary contacts,
and don't allow addition/removal of the primary contact field
post-creation.
In the backend, make sure that all emails previously added still exist.
We're changing the way that allocation tokens work in suboptimal (i.e. incorrect) situations in the domain check, creation, and renewal process. Currently, if a token is not applicable, in any way, to any of the operations (including when a check has multiple operations requested) we return some variation of "Allocation token not valid" for all of those options. We wish to allow for a more lenient process, where if a token is "not applicable" instead of "invalid", we just pass through that part of the request as if the token were not there.
Types of errors that will remain catastrophic, where we'll basically return a token error immediately in all cases:
- nonexistent or null token
- token is assigned to a particular domain and the request isn't for that domain
- token is not valid for this registrar
- token is a single-use token that has already been redeemed
- token has a promotional schedule and it's no longer valid
Types of errors that will now be a silent pass-through, as if the user did not issue a token:
- token is not allowed for this TLD
- token has a discount, is not valid for premium names, and the domain name is premium
- token does not allow the provided EPP action
Currently, the last three types of errors cause that generic "token invalid" message but in the future, we'll pass the requests through as if the user did not pass in a token. This does allow for a default token to apply to these requests if available, meaning that it's possible that a single DomainCheckFlow with multiple check requests could use the provided token for some check(s), and a default token for others.
The flip side of this is that if the user passes in a catastrophically invalid token (the first five error messages above), we will return that result to any/all checks that they request, even if there are other issues with that request (e.g. the domain is reserved or already registered).
See b/315504612 for more details and background
We suspected this could be a cause of optimistic locking failures
(because long transactions would lead to optimistic locks not being
released) but this didn't end up being the case. Let's remove this to
reduce log spam.
1. This doesn't remove the SQL tables yet (this is necessary to pass
tests and also good practice just in case we need or want to look at
history for a little bit)
2. This also removes the Registrar, RegistrarPoc, and User base classes
that were only necessary because we were saving copies of those
objects in the old history classes.
We no longer need to union GKE+GAE logs since we've moved all production
traffic to GKE only.
For testing, I copied the affected *_test.sql files to Bigquery, removed
all the "-alpha" bits, and changed the dates to 20250301 and 20250331
and ran them to make sure they returned the expected data.
Now that we have effective global sessions thanks to #2734, there is no
longer a need to keep the number of pods on the EPP service static.
We are also not vulnerable to random pod restarts. K8s never guarantees
perpetual pod lifetime anyway, and not having to be at its mercy is
certainly a relief.
It turns out period can be used in the URI, such as in
"urn:ietf:params:xml:ns:fee-0.12". I don't think pipe is used, at least
not according to EPP URI namespace naming convention.
Ideally we'd use serialization, but using the default serialization runs
the risk of it being platform/JDK dependent, so a new deployment might
not be able to deserialize existing cookies. A custom serializer that
guarantees stability would have been needed.
This was added back in early 2018 long ago to enable promotions, but
since then (and for many years) we've added the ability to run
promotions on the tokens themselves, rather than relying on custom Java
classes.
This will make the changes for b/315504612 much easier, as that will
split up token validation into "is this token valid in general?" and "is
this token valid for this domain/action?"
This can potentially help even more with serializable transaction
failures (optimistic locking exceptions, which are expected to occur
somewhat frequently).
With six attempts, we will sleep at most five times, for
100+200+400+800+1600 ms each, for a total of at most 3.1 seconds (much
less than the EPP maximum which I believe (?) to be 30 seconds.
In addition, we add a 20% skew in an attempt to spread out
possibly-conflicting transaction retries.
This changes the code to only save console histories of this type. We
keep the old Java code (and, necessarily, the corresponding SQL code)
for now because there's no harm in doing so and we want to avoid hastily
deleting too much.
The SQL statement was incorrectly flooring to zero one layer too deep, which was
negating all negative transaction report rows (which occur most frequently when
a domain in the autorenew grace period is deleted). I've changed it so that it
now only floors to zero at the report level, which still solves the issue
reported in http://b/290228682 but whose original fix caused the issue
http://b/344645788
This bug was introduced in https://github.com/google/nomulus/pull/2074
I tested this by running the new query against the DB for 2024 Q4 using the
registrar that was having issues and confirmed that the total renewal numbers
for .app now match with the sum total of what we invoiced for the last three
months of 2024.
This doesn't check for correctness (we have other scripts that do that)
but just that the service is available at all (the other scripts do not
do that).
This should, and will, be configured with a scheduled trigger in GCB (for us, in
the domain-registry-dev project) and configuration to send some sort of
pub/sub notification on failure (for us, this is already set up on
domain-registry-dev and it sends messages to the "Domain Registry
Notifications" chat channel.
We've seen this issue happen more often than not recently, where GAE
canary deployment is stuck for about 10 min and the failed. The reason
is not clear, but delete the canary version prior to a deployment always
fixes the issue.
We use the $environment property to set the console config. If it is not
given, 'alpha' is used, which has the same effect as 'production'.
TESTED=ran :jetty:copyConsole with
-Penvironment=(sandbox|production|alpha) and checked the resulting js
file.
For GKE all tasks should be on backend, BSA was on its own service
because of egress IP constraint.
Also made it possible to specify a timeout for the Cloud Scheduler job,
with the default (3m) suitable for most tasks.
There are two session cookies, JSESSIONID, which is set by Jetty, and
GCLB, which is set by the Gateway.
In one session, every request other than the first one (the <hello>)
should have the same GCLB value, and every request after a successful
<login> should have the same JSESSIONID.
With these two metadata, we should be able to trace all requests that
*should* belong to the same session and debug issues with session
mismatch (if any).
There can be delays in releasing predicate locks when we have
transactions that are long-lived -- even delays in releasing predicate
locks acquired by shorter-lived transactions. Logging the transaction
duration will allow us to get a sense as to transaction durations during
busy times.
This can happen on public endpoints (in pubapi) where the service is
behind IAP but all users (including not-logged-in ones) are allowed. IAP
will add an OIDC token with no email field in the request header.
This removes logic from an inner helper method so that it becomes more clear
from callsites within each test exactly which behavior is expected from those
test conditions.
We now pin to postgreSQL v17 when running tests, which means that minor
version might increase without our intervention. This causes (at least)
the comment in the golden schema to change, and failing the test as a
result.
This PR adds the ability to strip lines that we deem as comment from the
comparison, so we don't have to do trivial upgrades to the gold schema
whenever there's minor version upgrade.
This allows us to easily tell which tag was deployed.
Also set the gateway to use named address so they are stable, and so
that we can attach an IPv6 record to it. Auto-provisioned addresses are
IPv4 only.
Failed pg_dump may not leave a file, failing the subsequent diffing and
causing the verifier to return success.
The verifier should abort in this case.
GKE logs are routed to a different dataset and the table is different.
The structs to look for are also different (jsonPayload vs textPayload
or protoPayload).
TESTED=Ran the resulting query in crash.
I obtained access to an IBM s390x VM so I thought I'd see how multi-arch
Nomulus is.
Our main application is in Java so it is already multi-arch, but several
tests use docker images that are by default x64. Luckily postgres has an
s390x port, but selenium does not. So I had to disable Screenshot tests
when the arch is not amd64.
This isn't a situation we'll encounter often, but if the client has an
allocation token that's valid for premium domains that gives a 0 cost,
we shouldn't require them to include the fee extension when creating the
domain. We already don't require it for standard domains.
We use this now for the DomainDeleteFlow in an attempt to figure out
what statements it's running (cross-referencing that with PSQL's own
statement logging to find slow statements).
This is kinda nonsensical because this use case is trying to apply a
single use token multiple times in the same domain:check request --
like, trying to use a single-use token for both create, renew, and
transfer while having a $0 create price and a premium renewal price.
This change doesn't affect any actual business / costs, since SPECIFIED
token renewal prices were already set on the BillingRecurrence
Instead of using a separate RenewalPriceInfo object, just map the
behavior (if it exists) onto the BillingRecurrence with a special
carve-out, as always, for anchor tenants (note: this shouldn't matter
much since anchor tenants *should* use NONPREMIUM renewal tokens anyway,
but just in case, double-check).
This also fixes DomainPricingLogic to treat a multiyear create as a
one-year-create + n-minus-1-year-renewal for cases where either the
creation or the renewal (or both) are nonpremium.
We are required to respond to HTTP(S) requests on port 80/443 on the
same domain where we serve port 43 WHOIS requests. The proxy already
does this by redirecting to the web WHOIS lookup page on the marketing
website.
This PR makes it so that requests to port 80/443 can be routed to the
proxy for redirect.
TESTED=tested on crash and the redirect works.
SecretManager based keyring not included in keyring bindings, resulting
in runtime failure.
We should simply keyring bindings. There is no use case for multiple
implementations. See b/388835696.
Knonw nested transact calls found in sandbox have been refactored away.
Enable logging in production to catch any missing cases. Logging is
throttled at 1 message per minute per VM.
k8s does not have a way to expose a global load balancer with TCP
endpoints, and setting up node port-based routing is a chore, even with
Terraform (which is what we did with the standalone proxy).
We will use Cloud DNS's geolocation routing policy to ensure that
clients connect to the endpoint closest to them.
- We never delete rows from DomainHistory (and even if we do in the
future, they'll be old / the references won't matter)
- This is likely creating lock contention when lots of requests come
through at once for domains with many DomainHistory entries
There are some breaking method changes in the 10.x.y versions and we're encountering exceptions when trying to run the flywayMigrate task thanks to those.
For now it only includes two options (domain deletion and domain
suspension). In the future, as necessary, we can add other actions but
this seems like a relatively simple starting point (actions like bulk
updates are much more conceptually complex).
This might help alleviate DB transaction contention on the PollMessage table. A
lower transaction isolation level is safe because acking a poll message is
idempotent: there are only two things it does, either delete a poll message or
take a recurring one from the past and set it to be a year in the future from
the date in the past. Both of these operations will always yield the same final
result even if executed multiple times simultaneously for some reason.
It doesn't need a higher transaction isolation level as it's only loading a given poll
message once, and we want to avoid putting any kind of locks on the PollMessage table
as it seems to be having contention issues. Note that the poll message request flow
is by far the most frequent code that touches the PollMessage table, as there are many
many requests every minute from dozens of registrars, but much fewer poll messages
than that to actually ACK.
We only include the deletion time if the domain is in the 5-day
PENDING_DELETE period after the 30 day REDEMPTION period. For all other
domains, we just have an empty string as that field.
This is behind a feature flag so that we can control when it is enabled
It was failing when any kind of transfer data was present, even completed
transfer data. Note that completed transfer data persists on a domain
indefinitely until/unless a new transfer is requested.
BUG= http://b/377328244
It seems like `/usr/bin/python` is no longer symlinked to the `python3`
binary in the `gcr.io/cloud-builders/git` image.
I've sent out a separate fix to upstream to change the shebang.
https://gerrit-review.git.corp.google.com/c/gcompute-tools/+/439501
But in the meantime, we need this temporary fix for the release to
build.
These flags are suggested by GitHub support to disable reusing caches
during Gradle build. They think that could fix the intermittent error
message:
```
Encountered a fatal error while running "/opt/hostedtoolcache/CodeQL/2.19.0/x64/codeql/codeql database finalize --finalize-dataset --threads=4 --ram=14576 --verbosity=progress++ /home/runner/work/_temp/codeql_databases/java". Exit code was 32 and last log line was: CodeQL detected code written in Java/Kotlin but could not process any of it. For more information, review our troubleshooting guide at https://gh.io/troubleshooting-code-scanning/no-source-code-seen-during-build . See the logs for more details.
```
This fails for me every day for some reason (starting about a month
ago). The same commit went through the workflow fine when the action was
triggered by a push.
I think there's no reason for us to have a cron run as the changes to the
master branch can only come from commit pushes.
Newer versions of GPG (v.2.4.5 in my case) has uses different wording
then what's available in our build image (and Ubuntu I suspect). For
example it says "rsa2048" instead of "2048-bit RSA".
Make the tests work in both cases. Admittedly we cannot check for the
string RSA/rsa easily, but I don't think it matters much for tests.
It looks like /usr/bin/python *may* no longer exists in the latest cloud
builder git image. I ran the latest image and logged into it to verify
that /usr/bin/python3 does exist on 9/25, and again on 9/26 where it
re-appeared.
I think it is generally a good idea to not rely on it being there going
forward.
* Include discount price in domai n pricing
* Partial progress in logic
* Tests and logic passing
* Change pricing for multi year create
* Tests for discount pricing logic
* Token currency check
* Add some comments
* Java formatting
* Discount price to Optional
* Change discount price to be optional nullable
* Re-add deleted tests
Depending on if a "--gke" parameter (must be the last one) is passed,
the deployer constructs the corresponding URIs for GAE or GKE
accordingly.
TESTED=Used the deployer to deploy tasks to alpha and verified that they
run on GKE.
* Drop the `transact` call in Id services
All usages already routed through `tm().allocateId()`, which is
guaranteed to be in a transaction.
* Addressing reviews
Several jars in our dependencies are now multi-release, including
dnsjava and snakeyaml, and a few more. Such jars include
jvm-version-specific classes that will only be loaded by the vm that can
handle them. All it takes is a new manifest attribute.
This change allows us to upgrade to dnsjava3.6+: the base (java 8) version of
this jar breaks java21. The correct manifest allows java21 to find the
classes it needs.
This single metric currently accounts for 22.2% of our total metrics bill,
almost double the size of our EPP requests metric, while also simultaneously
being much less useful. This change reduces the cardinality by removing two
parameters we don't care that much about, which should significantly reduce the
size and thus the cost. If after this change the metric is still too large, I'll
also then remove the matchCount parameter from this metric. We could possibly
even consider deleting the metric in its entirety, as we hardly ever use it.
This PR also removes unused code for premium list metrics that have never
actually been written out (and that we won't bother with at this point).
This will help us if/when we need to run the report generation multiple
times, or for past dates and we don't want to send extra emails or
upload any extra reports to ICANN.
Instead of having to parse the protoPayload.line from the request logs,
we just want to inspect the textPayload from the app logs (stored in a
separate table). This applies to the EPP metrics from the activity
reporting and the attempted-adds column for the transaction reporting.
This doesn't really add any tests, and we'll require many more additions
if we actually want to have full unit testing, but this at least makes
the tests pass when running `npm test`.
Previous PRs and token changes (see b/332928676) have made it so that
SPECIFIED renewalPriceBehavior tokens must have a renewal price. As
such, we can now use that renewalPrice when creating domains with
SPECIFIED tokens.
* Add QPS and incomplete connections metrics to load test client
* Add a failed request count
* Add todos
* Reuse contact
* Add bugs to todos
* small fix
* Clarify QPS
It's valid for the auth data to be null (although it only happens 10 times
across our entire registry), so the domain update flow should not fail out with
a NullPointerException when the existing state of the data is null and the
update isn't adding that data either.
BUG=http://b/359264787
This is the first step in the field removal (second will be removing the
column from SQL once this is deployed).
There's no point in using a UserDao versus just doing the standard
loading-from-DB that we do everywhere else. No need to special-case it.
When creating/deleting users, we need to add/remove the emails in
question to/from the console email group (if it exists). This used to be
done synchronously by calling the Groups API directly from the nomulus
tool. However #2488 made it so that in all cases where group membership
is modified, a Cloud Tasks task is created to execute the change on
the server side asynchronously (because there are multiple places where
this change needs to be done, and it is easier to make it all happen on the
server side).
Alas, as it turns out, Cloud Tasks tasks need to be created with a
service account's credential (which is trivially done on the server side
because the ADC is a service account). Nomulus command runs with a user
credential, and we need to grant the relevant user permission to
masquerade as a service account, in order to enqueue tasks from the
nomulus tool. It is therefore easier to just revert to the old behavior.
Originally, we though that User entities were going to have mutable
email addresses, and thus would require a non-changing primary key. This
proved to not be the case. It'll simplify the User loading/saving code
if we just do everything by email address.
Obviously this doesn't change much functionality, but it prepares us for
removing the id field down the line once the changes propagate.
For instance, on sandbox this will allow us to remove our global roles
but keep roles to the CharlestonRoad admin registrar. Then, when we view
the console, it will be as if we were a registrar user.
This is consistent with how other registries are handling RDAP and is also consistent
with overall behavior in WHOIS and domain info flows as implemented in my previous
PRs #2477 and #2490.
* Check FeatureFlag in domain flows before checking contacts
Check if phase 1 has begun of the transition to the minimum registry dataset, and if it has, do not require the presence of contacts in domain flows.
* Add tests
* Small test fixes
* rename flag
* Fix merge conflicts
* Change todo
* Add isActive methods
* Add javadocs
* small fix
As requested, for registrars participating in these tiered pricing
promos that wish to receive this type of response, we make the following
changes:
1. The pre-promotional (i.e. base tier) price is returned as the
standard domain-create fee when running a domain check.
2. The promotional (i.e. correct) price is returned as a special custom
command class with a name of "STANDARD PROMO" when running a domain
check
3. Domain creates will return the non-promotional (i.e. incorrect) price
rather than the actual promotional price.
This PR does only number 3. See PR #2489 for the others.
As requested, for registrars participaing in these tiered pricing promos
that wish to receive this type of response, we make the following
changes:
1. The non-promotional (i.e. incorrect) price is returned as the
standard domain-create fee when running a domain check.
2. The promotional (i.e. correct) price is returned as a special custom
command class with a name of "STANDARD PROMO" when running a domain
check.
3. Domain creates will return the non-promotional (i.e. incorrect) price
rather than the actual promotional price. This is not implemented in
this PR.
Our logs are getting gummed up with an indefinitely failing and retrying task to
re-save a prober domain that doesn't exist (likely because it was hard-deleted
by delete prober data action), so this makes the re-save action resilient to
that failure case so that it stops assuming every enqueued re-save actually
corresponds to an entity that exists, thus allowing it to fail permanently if
the entity doesn't exist. Failing permanently is the right thing to do as if
the entity doesn't exist now there's no reason to think it will in the future,
plus all re-saves are optimistic rather than guaranteed anyway.
This should fix http://b/350530720
* Add registryTool commands for FeatureFlags
* Fix merge conflicts
* Add required parameters and inject mapper
* Use optionals in cache to negative cahe missing objects
* Fix spelling
* Change back to bulk load in cache
* Add FeatureName enum
* Change variable name
* Use FeatureName in main parameter
This doesn't yet allow them to be absent in EPP flows, but it should make the
code not break if they happen to be null in the database. This is a follow-up to
PR #2477, which ends up being a bit easier because whereas the registrant is
used in more parts of the codebase, the other contact types (admin, technical,
billing) are really only used in RDE, WHOIS, and RDAP, and because they were
already being used as a collection anyway, the handling for if that collection
contains fewer elements or is empty happened to already be mostly correct.
When using this token (which must be tied to a particular domain), the
first year price (and only the first year price, i.e. the creation
price) will be the standard price for this TLD. Future years (i.e.
renewals) will continue at the normal premium price.
This isn't the worst thing in the world but it does result in a bad
request to the server otherwise, and log/error spam. So, only load the
domains list if we have a registrar selected.
This is the first step in migrating to the minimum registration data set. Note
that our database model already permits null domain registrants, so this just
makes the code accept it as well. Note that I haven't changed any requirements
in EPP flows yet; a later step will be to check the migration schedule and then
not require the registrant to be present if in a suitable state.
This does potentially affect the output of WHOIS/RDAP, but that's a NOOP so long
as EPP commands and other tools continue to enforce the requirement of a
registrant.
This is an optional field (will be required when the renewal price
behavior is SPECIFIED). This will allow us to set arbitrary renewal
prices for domains as part of one-off negotiations.
https://b.corp.google.com/issues/332928676
This is the last remaining GAE API that we depend on. By removing it, we are able to remove all common GAE dependencies as well.
To merge this PR, we need to create console User objects that have the same email address as the RegistrarPoc objects' login_email_address and copy over the existing registry lock hashes and salts.
We are also able to simply the code base by removing some redundant logic like AuthMethod (API is now the only supported one) and UserAuthInfo (console user is now the only supported one)
There are several behavioral changes that are worth noting:
The XsrfTokenManager now uses the console user's email address to mint and verify the token. Previously, only email addresses returned by the GAE Users service are used, whereas a blank email address will be used if the user is logged in as a console user. I believe this was an oversight that is now corrected.
The legacy console will return 401 when no user is logged in, instead of redirecting to the Users service login flow.
The logout URL in the legacy console is changed to use the IAP logout flow. It will clear the cookie and redirect the users to IAP login page (tested on QA).
The screenshot changes are mostly due to the console users lacking a display name and therefore showing the email address instead. Some changes are due to using the console user's email address as the registry lock email address, which is being fixed in Add DB column for separate rlock email address #2413 and its follow-up RPs.
GKE now displays log messages correctly. There is no need for an extra
leading newline, which now results in a useless blank line for each log
entry in Log Explorer.
* Create a load testing EPP client.
This code is mostly based off of what was used for a past EPP load testing client that can be found in Google3 at https://source.corp.google.com/piper///depot/google3/experimental/users/jianglai/proxy/java/google/registry/proxy/client/
I modified the old client to be open-source friendly and use Gradle.
For now, this only performs a login and logout command, I will further expand on this in later PRs to add other EPP commands so that we can truly load test the system.
* Small changes
* Remove unnecessary build dep
* Add gradle build tasks
* Small fixes
* Add an instances setUp and cleanUp script
* More modifications to instance setup scripts
* change to ubuntu instance
* Add comment to make ssh work
For reasons unclear to me, if the stack trace is appended directly to
the message, the log entry will be lumped together with following logs
on GKE.
Also updated the GKE service account for Nomulus in the manifest so we
can use workload identity just for Nomulus, not other pods on the same
cluster.
There are a bunch of cases where we want common exception handling and
it's annoying to have to deal with the common "set failed response and
make sure to return" a bunch of times.
This allows us to break up request methods more easily, since we can now
often throw exceptions that will break all the way back up to
ConsoleApiAction. Previously, any error handling had to exist in the
primary handler method so it could return.
The user, on the front end, should not be required to provide whether or
not they're trying to verify a lock or an unlock. They should only need
the verification code. We can inspect the lock object itself (and the
domain in question) to see whether or not we're verifying a lock or an
unlock.
We've added the field in the database in a previous PR. This is only
used in the old console for now because the new console does not have
registry lock functionality yet
* Remove the createBillingCost field from Tld
* fix spacing
* Change field name of map
* Rename getter
* Fix formatting
* Fix todo
* unchange column name
* Add log traces to Nomulus service on GKE
Add request-scope log traces to Nomulus on GKE which, unlike
AppEngine and Cloud Run etc, does not generate traces for hosted
applications. This change only affects the GKE image. It does not affect
the AppEngine services.
Log traces are added to Nomulus-generated logs in request-processing
threads. Forked threads are not covered yet. The single relevant use
case (TimeLimiter) will be addressed in a followup PR.
The main change is in the logging configuration:
* Use gcp-cloud-logging's LoggingHandler
* Add gcp-cloud-logging's TraceLoggingEnhancer to the handler.
* Set a thread-local trace id through the TraceLoggingEnhancer in
ServletBase on request's entry and clear it on completion.
Also removed an unused class (`RequestLogId`).
* CR
* CR
This handles both GET and POST requests. For POST requests it doesn't
actually change anything about the domains because we will need to add a
verification action (this will be done in a future PR).
* Avoid contention over the RefreshDnsRequest table
This table can be small at times, when PSQL may use table scan in
queries by keys. At the SERIALIZABLE isolation level, updates to
unrelated rows may conflict due to this `optimization`.
Lower the isolation level to repeatable read.
* Code review
This includes using the new switch format (though IntelliJ does not yet
understand patterns including default so those aren't used), multiline strings,
replacing some unnecessary type declarations with <>, converting some classes to
records, replacing some Guava predicates with native Java code, and some other
miscellaneous Code Inspection fixes.
This adds an index on transfer_billing_cancellation_id to Domain and superordinate_domain to Host. When tested on crash with the action limited to only delete 10,000 domains, before these indexes were added the action took about 2 hours to delete 10,000 domains. Once these indexes were added, the action was able to delete the 10,000 domains in a little under 2 minutes.
This also creates base classes for the objects contained within the
history classes, e.g. RegistrarBase. This is the same way that objects
stored in the HistoryEntry subclasses have base classes, e.g.
DomainBase.
We cannot rely on the user checking their login email, so we'll want to
send the emails to the other address if configured. This is already the
case in RegistrarPoc.
Use REPEATABLE READ for lock acquire/release operation to avoid conlicts
between locks.
Postgresql uses table scan on small tables, causing false sharing at
the SERIALIZABLE isolation level.
See b/333537928 for details.
Console users need IAP to inject the necessary OIDC tokens into their
request headers and therefore need to be bound to appropriate roles. Note
that in environments managed by latchkey, the bindings will need to be
present in latchkey config files as well, otherwise the changes made by
the nomulus tool will be reverted.
TESTED=ran the nomulus command against alpha and verified that the
bindings are created/removed upon console user creation/deletion.
The existing behavior was to ignore bad header names, in a way that was
counter-intuitive as a user of the Google Sheet. If a header name was bad (which
could just be someone accidentally changing it not realizing it needs to
correspond exactly to the name of the field on the Java object), then all of the
data in that column was just silently left as-is and never updated. This led to
gradually worsening sync and offset shift errors over time.
Now, it will write out an error message into every single cell in the bad
column, so it's clear that the column name is wrong and does not correspond to any
actual data in the DB.
BUG=http://b/332336068
As a read-only action that tolerates staleness, locking is unnecessary.
This should help with the lock contention we are observing.
Also reduces the number of VM instances provisioned for BSA and increase
the idle timeout. This should reduce invocation delay. Longer delay may
cause AppEngine to return `Timeout` status to Cloud Scheduler even
though the cron job succeeds.
This also involved breaking out an improperly done assertThat() helper overload
method for JsonObjects into a proper Subject that doesn't further overload
assertThat().
* Check for missing BSA unblockable domains
All unblockable domains created before the last refresh run should be
reported as unblockable (registered).
All reserved domains that are not registered should be reported as
unblockable (reserved). Note that transient errors may be reported for
newly added reserved domains since we do not maintain update time for
when a reserved label is associated with a TLD. However, this scenario
is extremely rare in operations.
* Addressing review
* Add batching to DeleteProberDataAction
* Only get time once
* Add separate query for dry run
* Update querries to actually properly delete all the data
* Fix merge conflicts
* Add test for foreign key constraints
* Make transaction repeatable read
* Make queries to subtables native
* Add native query for GracePeriodHistory
* Kill job after 20 hours
* remove extra time check from read query
* Add index for domainRepoId to PollMessage and DomainHistoryHost
* Add flyway fix for Concurrent
* fix gradle.properties
* Modify lockfiles
* Update the release tool and add IF NOT EXISTS
* Test removing transactional lock from deploy script
* Add transactional lock flag to actual flyway commands in script
* Remove flag from info command
* Add configuration for integration test
* Verify unblockables are truly unblockable
Unblockable domains may become blockable due to deregistration or
removal from the reserved list. The BSA refresh job is responsible
for removing them from the database. This PR verifies that the refreshes
are correct.
Note that recent changes since last refresh are not reflected in the
result, and inconsistency due to recent deregistrations are ignored.
Changes in reserved status or IDN validity are not timestamped,
therefore we cannot ignore recent inconsistencies. However, these
changes are rare.
* Addressing code review
* Addressing code review
Upgrade to using Jakarta EE 10 from Java EE 8 by mostly following the upgrade instructions. Only the servlet package is upgrade. Other Jakarta EE components (like the persistence package that Hibernate depends on) need to be upgraded separately.
TESTED=deployed and successfully communicated with the pubapi endpoint for web WHOIS.
Note that this currently requires packaing the App Engine runtime per instructions here due to GoogleCloudPlatform/appengine-java-standard#98. This PR will only be merged until the fix is deployed to production (https://rapid.corp.google.com/#/release/serverless_runtimes_run_java/java21_20240310_21_0).
We don't use the upload results feature (kokoro picks the results
artifacts directly and uploads them).
Keeping it around is a maintenance burden.
Also fixed a deprecation warning.
Labels that are not in any supported IDN are not added to the database.
Remove such labels from those loaded from the block list files before
comparing with DB.
There is one remaining instance in JpaTransactionManagerImpl that cannot
be removed because DetachingTypedQuery is implementing TypedQuery, which has
a method that expectred java.util.Date.
Console tests fail for the files that are affected by redesign. There's no point in fixing it here. I will reenable the task after the console redesign PR is merged
Note that Dagger currently doesn't work with the Jakarta namespace and
we have to cap the jakarta inject package version below 2.0 so that it
sill provides classes in the old namespace.
Removed the deprecation mark as it is natural to expose methods related
to a transaction like getting the entity manager or checking if one is
in a transaction through the transaction manager interface.
Also only caches/resets the original TM when in unit tests (TBT I'm not so sure
that even this is necessary as we don't seem to call the tool from tests
that often. There is only ShellCommandTest that calls the run() function
in RegistryCli and we could just put these tests in fragileTest and make
them run sequentially and fork every time to get around issue with
inference).
The issue with caching is that it tries to first create the to-be-cached
TM, and when the environment given is prod/sandbox/... It will try to
retrieve SQL credentials from prod/sandbox/... secret manager. This
works fine locally as we all have access to prod/sandbox/..., but fails
in Cloud Build jobs such as sync-db-objects where it provides it own
credential that has direct SQL access, but not access to
prod/sandbox/... secret manager.
TESTED=ran `./gradlew devTool --args="-e localhost generate_sql_er_diagram -o ../db/src/main/resources/sql/er_diagram"`
* Add a GitHub Action workflow
This allows us to create Gradle depedency graphs for Dependabot analysis (as the ones we already get for Javascript dependencies).
* Update Java version
* Add build scan
* codeql 3
* run with gradle
* exclude jIFC
* build scan
* Finalize
Chose the default transaction manager based on RegistryEnvironment. This
makes it possible to run nomulus on Jetty locally. Tested with the
following:
```bash
./gradle :jetty:run -Penvironment=alpha
curl http://localhost:8080/beta.app
```
The docker image is also updated to take an argument that specifies the
environment. It runs locally as well but the container doesn't get
access to locally stored credentials, so it fails to initialize the
transaction manager.
* Add BSA validation job
Add the BsaValidateAction class with a first check (for inconsistency
between downloaded and persisted labels).
* Addressing comments
* Addressing reviews
The staging job runs at 9AM on the 2nd day of each month, we should set
the cursor to be after that time, otherwise we attempt to upload reports
on the 1st day of each month before they are ready, causing an error
email to be sent to us.
Remove email classes that depend on AppEngine API. They have been
replaced by the gmail-based client.
Remove `EmailMessage.from` method, which is no longer used.
There is a fixed sender address for the entire domain, and is
set by the gmail client.
The configs remain to be cleaned up. There is a bug (b/279671974) that
tracks it.
This PR makes the runtime of most of our workload Java 21.
1. App Engine. Java 21 is in GA and it supports Java EE 8. I had to add
an environmental variable so that we don't get an
AppEngineCredentails by default (we have been using
ComputeEngineCredentials for a couple of years). The uprade to Java
21 runtime changed a system property that controls how jetty logging
works, which also control if AppEngineCredential is return. Tested by
deploying to alpha.
2. Proxy base image upgradedd to Java 21 (distroless still doesn't
support Java 21 and it looks like Temurin is the way to go
b/306728455). Tested by deploying to alpha.
3. Nomulus tool image upgrade to Temurin 21 as well. Tested locally.
4. Beam pipeline base image upgrade to Java 21. The JAVA21 flag is not
supported by gcloud yet, but specifying the image URL directly works
(and is supported). Tested by running in alpha.
5. Jetty base image upgraded to Java 21. Tested locally.
Make the necessary changes for the code base to compile with JDK 21.
Other changes:
1. Upgraded testcontainer version and the SQL image version (to be the
same as what we use in Cloud SQL). This led to some schema changes and
also changed the order of results in some test queries (for the
better I think, as the new order appears to be alphabetical).
2. Remove dependency on Truth8, which is deprecated.
3. Enable parallel Gradle task execution and greatly increased the
number of parallel tests in standardTest. Removed outcastTest.
This PR creates a unified RegistryServlet that will serve all
non-console traffic. It also creates a jetty subproject that allows one
to run Nomulus on top of a standard Jetty 12 runtime.
`./gradlew :jetty:stage` will create a jetty base folder at
`jetty/build/jetty-base` where one is able spin up a local Nomulus server
by running the following command inside the folder:
```bash
java -jar ${JETTY_HOME}/start.jar
```
`JETTY_HOME` is a folder where the [Jetty runtime](https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-home/12.0.6/jetty-home-12.0.6.zip) is located.
This PR also adds a Gradle task to create a Nomulus image based on the
official Jetty image:
```bash
./gradlew :jetty:buildNomulusImage
```
This reverts commit 64f5971275.
The catch block is too broad and most of the times the errors caught is
because `command.run()` failed and it had nothing to do with getting
the transaction manager. The `runCommand` method is already wrapped in a try
block that checks for `LoginRequiredException` and gives the appropriate
error message.
We need to re-assess the situation when the next time we encounter a
login issue that did not trigger `LoginRequiredException`. A blanket try
catch block is not the solution and only makes the situation more
confusing.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/2342)
<!-- Reviewable:end -->
* Add a check so newly saved createCostTransitions get recognized and saved to the database
* Fix equals check
* Rename equals method
* Add comment explaining need for createBillingCostTransitionEqualCheck
* Remove use of shouldPublishField from ReservedList
* Remove from tests
* Update test comment
* Fix indentation
* fix test comment
* Fix test
* fix test
* Make shouldPublish column nullable
We introduced Scrypt as the default password hashing algorithm in
November 2023 and have been auto-converting saved hashes whenever a
successful EPP login or registry lock/unlock request is processed.
We will send comms to registrars to inform them the upcoming removal of
SHA256 support and urge them to log in at least once before the change.
Otherwise, they will need to contact support to reset the password out of
band after the change.
This PR will NOT be submitted until comms are out and the effective date
is immediate.
Co-authored-by: Weimin Yu <weiminyu@google.com>
This makes the callsites look neater, as the work to execute itself is often a
many line lambda, whereas the transaction isolation level is not more than a
couple dozen characters.
The Distinct transform removes duplicates based on the serialized format
of the elements. By providing a deterministic coder, we can guarantee
that no duplicates exist.
This is the start of a long and low priority migration, but for now I wanted to do a couple of them just to see what it looks like.
This also demonstrates the pattern for use of an @AutoBuilder to replace an @AutoValue.Builder. See https://github.com/google/auto/blob/main/value/userguide/records.md#builders for full details on that.
This removes a dependency on the App Engine SDK. It also looks like
(from the logs at least) that shutdown hooks registered the old way stopped
working after the runtime is upgraded to Java 17.
Also removed some random leftover dependencies on the App Engine SKD
that are not needed any more.
* Add java changes for createBillingCostTransitions
* Add negative cost test
* Remove default value
* remove unused variable
* Add check that create cost and trnasitions map are the same
* inject clock, only use key set when checking for missing fields
* Add test for removing map
* Add ALLOW_BSA allocation type
Add a new type to allow creation of domains blocked by BSA.
Except for the BSA semantics, the new type behaves exactly
like SINGLE_USE.
* Addressing reviews
* Addressing review
* Change tld-update to db-object-updater
* rename sync_tlds.sh to sync_db_objects.sh
* Change to configured command name
* Change environment to sandbox explicitly for testing on alpha
* Add remaining object steps and change cloudbuild-tld-sync to cloudbuild-sync-db-objects
* Add build_environment flag
* Change order of command and directory
* Uncomment out reserved list part
Query size is borderline too-large for the replica.
At 50000, about 2 out of 30 took more than 30 seconds and were retried.
Lower to 40000 and we will keep monitoring the executions.
The fallback should only apply on creates, not on updates, otherwise it can
override an existing value for the email address when only the referral email
should be what's updated.
This fixes a bug introduced back in commit in 0ead4f8d9d.
BUG= http://b/322026165
* Stop saving BSA empty refresh changes
We thought that as a way to verify the refresh job to be running, browsing
the GCS bucket with empty files is easier than quering the DB or go to GCP
logging dashboard, but there are too many of them to be useful.
* Add some updates to UpdateReservedListCommand to facilitate internal config presubmits and syncing
Added a dry-run tag for presubmit tests
Added early exit behavior when there are no new changes to the list
Added a new --build_environment tag to be used to indicate command runs from build tools. This tag was also added to UpdatePremiumListCommand. Once this new tag is deployed, and break glass behavior is added, these commands will be modified to prevent runs on the command line in the production environment unless the --build_environment or --break_glass flag is used.
* Fix capitalization
* Added in commented out production environment check for buildEnv flag
The blocking issue is fixed in
https://github.com/google/nomulus/pull/2224.
Java 8 support is being deprecated on 2024-01-31 and no further deployment is
possible afterwards without exception:
https://cloud.google.com/appengine/docs/legacy/standard/java/deprecations
We have been using Java 17 on alpha/crash/qa for several months and have
not oberved any other blocking issue other than possible missing email
attachements, which is being mitigated by including a link to the
attachments saved in GCS.
* Fix BsaRefreshAction bugs
Added functional tests for BsaRefreshAction, which checks for changes in
domain registration and reservation, and apply them to the Unblockable
domain list.
Fixed a few bugs exposed by the tests.
Also refactored a few other tests.
This also moves it back to the replica transaction manager now that it shouldn't be timing
out its queries.
And this adds a test as well (more to come!).
* Define the --end_breakglass and --build_environment flags
It is necessary to define these flags in a deployment before merging go/r3pr/2273 in order to prevent breaking the exisitng TLD syncing and entity presubmit testing that has already been enabled
* make break glass 2 words
* Change break_glass flag to take a Boolean and use false value to end break glass mode
* small fixes
* Fix spacing
* Add missing G
* Add clarifying comment
* Refactor a few BSA unit tests
Added a few helpers for managing reserved list in tests and updated the
tests to use them.
Also fixed a bug: when quering for newly created domains, the query
should be restricted to bsa-enrolled tlds.
* Remove create_tld and update_tld commands
These commands are no longer necessary now that configure_tld command is available. However, the configure_tld command should only be used for crash, QA, and alpha environments. TLDs in production and sandbox must be modified using modifications to their config files in Gerrit unless using the configure_tld command in breakglass mode. Check the "How to configure TLDs" procedure doc for more info.
* re-delete file
- Registered names are not affected.
- Reserved names are not affected.
- Names that are none of the above and match some BSA labels are
reported as blocked.
Also fixes the action taken in the case where zero unavailable domains are
found, and temporarily changes over to using the primary DB (because the replica
transaction was timing out at 30 seconds on large databases). I'll switch this
over to use batching and move it back to replica afterwards, but this should
unblock us temporarily.
This PR makes it possible to build the Nomulus code base using Java 17.
Building with Java 11 continue to be possible and the resulting bytecodes are
still at Java 8 level. Also upgraded Gradle to 8.5.
There are several necessary changes to make this happen:
1. Some Gradle plugins need to be upgraded to support Java 17, notably
errorprone. As a result, a lot more "errors" were caught and corrected.
2. All test code are now built and run at Java 8 level. Previously it was left
undefined (which defaults to the version of the compiler) and had led to
situations where we inadvertently called Java 8+ features in production that
are not caught by tests. The change also made the java8compatibility subproject
obsolete, which is therefore removed.
3. Removed the docs subproject. Its main use is to generate flows.md, but it
relies heavily on Java internal APIs that have changed significant with each
version. Upgrading to Java 11 required extensive refactoring of the code there,
and Java 17 again removed many APIs that were used. I don't think it is worth
the maintenance effort just to have a tool to generate flows.md which no one
actually reads.
4. Capped a few GCP dependencies because the latest version depends on
grpc-java >= 1.59.0, which includes a runtime incompatibility
(https://github.com/grpc/grpc-java/releases/tag/v1.59.0).
* Check BSA block status in CheckApi
Checks for and reports BSA block status if the name is not registered or
reserved.
Also moves CheckApiActionTest to standardTest. Whatever problem forcing
it to another suite has apparently disappeared.
Supports the full blocklist download cycle (download, diffing, diff-apply, and order-status reporting) and the refreshing of unblockable domains.
Submitted due to tight deadline. We will conduct post-submit review and refactoring.
SCRYPT is much computationally heavier than SHA265 (by design), which
resulted in test run time doubling due to most tests initializing canned
data that uses hashing.
Since out tests are not verifying the correctness of a specific hashing
algorithm anyway, this PR makes it so that simple concatenation is used
in tests.
Also moved RegistryEnvironment to the util subproject so it can be called by
PasswordUtils, which makes sense as it is a utility class.
* Add BigInt conversion to TimedTransitionProperty<Money> deserializer to handle JPY currency
* Remove unnecessary lines in test
* Add eap schedule check
* Don't use raw LinkedHashMap type
* add timezone
From our investigation, the Monday night WHOIS storm does not cause any
strain to the backend system. The backend latency metrics are all well within
the limits. The latency measured from the proxy matches observed latency
by the prober, and we see that the "used" CPU is 1.5x of "requested" CPU
during the time when the latency is above the threshold.
Making this change hopefully removes the proxy as the bottleneck and
ameliorate the pages.
Add the BsaDomainRefresh class which tracks the refresh actions.
The refresh actions checks for changes in the set of registered and
reserved domains, which are called unblockables to BSA.
Currently, a verify action is enqueued every time the upload method
succeeds. Because the upload job is wrapped in a transaction, the
same task will be enqueued again if the transaction retries.
We cannot move the upload method outside the transaction because the
read-upload-write logic needs to be atomic, and the upload part itself
is idempotent (therefore retri-able). We can, however, move the
enqueuing part outside the transaction as we only need to enqueue the
verify task once the transaction succeeds. This should fix the issue
where multiple verify jobs try to hit the same marksdb endpoints,
resulting in 429 (Too Many Requests) errors.
* Add a dryrun tag to UpdatePremiumListCommand and early exit command if no new changes to the list
* Change prompt string when no change to list to reflect that there is no actual prompted user input
* Add camelCase and correct flag name
This might be the cause of the SQL performance degradation that we are
observing during the recent launch. The change went in a month ago but
there hasn't been enough increase in mutating traffic to make it
problematic until the launch.
Note that presubmits should run faster too with this chance, which
serves as an evidence that excessive logging is the culprit.
Add the BsaDomainRefresh table which tracks the refresh actions.
The refresh actions checks for changes in the set of registered and
reserved domains, which are called unblockables to BSA.
This doesn't fix any issues with dead/livelocks when deleting or
updating allocation tokens, but it at least will significantly reduce
the time to load the tokens that we'll want to update/delete.
The code as previously written assumed that creation fees would be the
same as renewal fees -- this is not the case for anchor tenants, where
the renewal fee is always the standard cost for the TLD (instead of any
premium cost). This was already handled properly in the actual billing
implementation, but we didn't tell the user the right renewal cost in
domain checks.
This also removes some warning logs related to nested transactions
We shouldn't have to parse through every single entry to see what
changed
Note: we don't do this for premium lists because those can be HUGE and
we don't want/need to load and display every entry. This was an explicit
choice made in https://github.com/google/nomulus/pull/1482
Since the replica SQL instance is read-only, any transaction performed
on it should be explicitly read-only, which would allow PostgreSQL to
optimize away (some) use of predicate locks.
Also changed the EPP cache to read from the replica. The foreign key
cache already behaves this way.
See: https://www.postgresql.org/docs/current/transaction-iso.html
For reasons unclear at this point, Java 17's servlet implementation on
GAE injects IP addresses (including unroutable private IPs) into the
standard X-Forwarded-For header, which we currently use to embed
registrar IP addresses to check against the allow list. This results in
the server not properly parsing the header and rejecting legitimate
connections.
This PR sets a custom header that should not be interfered with by any
JVM implementation to store the IP address, while maintaining the old
header as a fallback. The proxy will set both headers to allow the
server to gracefully migrate from Java 8 and Java 17 (and potentially
rollback).
Also removed some headers and logic that are not used.
This does not include any styling for now, just wanted to make sure
we're all good with regards to the basic approach. I'm open to suggestion on
which columns to include.
Note: filter searching is not implemented yet because the backend does
not allow for it (yet)
Java 17 injects unexpected headers to X-Forwarded-For, which causes
issues with validating incoming IP addresses.
This is a partial reversion of #2201. We are still keeping Java 17 in other environment but sandbox and production needs to be able to parse the header to accept incoming EPP connections from registrars. Once we fix it we will re-enable Java 17 in these environment.
This allows us to switch the proxy to a different client ID without
disrupting the service. This is a temporary measure and will be removed
once the switch is complete.
Also adds a placeholder getter in the Tld class, so that it can be
mocked/spied in tests. This way more BSA related code can be submitted
before the schema is deployed to prod.
This also adds a targeted QPS as a parameter in case we need to manually bump it
up (or down) for some reason without having to make code changes and re-deploy.
This PR makes a few changes to make it possible to turn on
per-transaction isolation level with minimal disruption:
1) Changed the signatures of transact() and reTransact() methods to allow
passing in lambdas that throw checked exceptions. Previously one has
always to wrap such lambdas in try-and-retrow blocks, which wasn't a
big issue when one can liberally open nested transactions around small
lambdas and keeps the "throwing" part outside the lambda. This becomes a
much bigger hassle when the goal is to eliminate nested transactions and
put as much code as possible within the top-level lambda. As a result,
the transactNoRetry() method now handles checked exceptions by re-throwing
them as runtime exceptions.
2) Changed the name and meaning of the config file field that used to
indicate if per-transaction isolation level is enabled or not. Now it
decides if transact() is called within a transaction, whether to
throw or to log, regardless whether the transaction could have
succeeded based on the isolation override level (if provided). The
flag will initially be set to false and would help us identify all
instances of nested calls and either refactor them or use reTransact()
instead. Once we are fairly certain that no nested calls to transact()
exists, we flip the flag to true and start enforcing this logic.
Eventually the flag will go away and nested calls to transact() will
always throw.
3) Per-transaction isolation level will now always be applied, if an
override is provided. Because currently there should be no actual
use of such feature (except for places where we explicitly use an
override and have ensured no nested transactions exist, like in
RefreshDnsForAllDomainsAction), we do not expect any issues with
conflicting isolation levels, which would resulted in failure.
3) transactNoRetry() is made package private and removed from the
exposed API of JpaTransactionManager. This saves a lot of redundant
methods that do not have a practical use. The only instances where this
method was called outside the package was in the reader of
RegistryJpaIO, which should have no problem with retrying.
We have been using SHA256 to hash passwords (for both EPP and registry lock),
which is now considered too weak.
This PR switches to using Scrypt, a memory-hard slow hash function, with
recommended parameters per go/crypto-password-hash.
To ease the transition, when a password is being verified, both Scrypt
and SHA256 are tried. If SHA256 verification is successful, we re-hash
the verified password with Scrypt and replace the stored SHA256 hash
with the new one. This way, as long as a user uses the password once
before the transition period ends (when Scrypt becomes the only valid
algorithm), there would be no need for manual intervention from them.
We will send out notifications to users to remind them of the transition
and urge them to use the password (which should not be a problem with
EPP, but less so with the registry lock). After the transition,
out-of-band reset for EPP password, or remove-and-add on the console for
registry lock password, would be required for the hashes that have not
been re-saved.
Note that the re-save logic is not present for console user's registry
lock password, as there is no production data for console users yet.
Only legacy GAE user's password requires re-save.
The domains (and their associated billing recurrences) were already being loaded
to check restores, but they also now need to be loaded for renews and transfers
as well, as the billing renewal behavior on the recurrence could be modifying
the relevant renew price that should be shown. (The renew price is used for
transfers as well.)
See https://buganizer.corp.google.com/issues/306212810
Both actions have not been used for a while (the wipe out action
actually caused problems when it ran unintentionally and wiped out QA).
Keeping them around is a burden when refactoring efforts have to take
them into consideration.
It is always possible to resurrect them form git history should the need
arises.
We finally fixed Spinnaker (I hope) to deploy bundled services with Java
17 runtime. Note that the bytecodes are still targeting Java 8. The only
change this PR introduces is to switch the runtime environment to Java
17.
TESTED=deployed to crash.
This was already set to true in all environments except prod last week. Now that the release has gone out and we have not seen any issues, we should feel safe turning this on in production as well.
We previously needed to use URLFetch in some instances where TLS 1.3 is
required (mostly when connecting to ICANN servers),and the JDK-bundled SSL
engine that came with App Engine runtime did not support TLS 1.3.
It appears now that the Java 8 runtime on App Engine supports TLS 1.3
out of the box, which allows us to get rid of URLFetch, which depends on
App Engine APIs.
Also removed some redundant retry and logging logic, now that we know
the HTTP client behaves correctly.
TESTED=modified the CannedScriptExecutionAction and deployed to alpha, used the
new HTTP client to connect to the three URL endpoints that were
problematic before and confirmed that TLS connections can be established. HTTP
sessions were rejected in some cases when authentication failed, but
that was expected.
* Add a cloudbuild-tld-sync job
This job checks the Tld config files in the internal repo and syncs them with the actual Tld objects in the database using the configure_tld numulus command.
* Add the dockerfile and shell script
* Force the command
* Add comments
* add newline
* Create a separate copy of the job for each environment
* fix file name
* Fix indentation
* Lower the isolation level for RefreshDnsForAllDomainsAction
This lowers the isolation level to TRANSACTION_REPEATABLE_READ which will hopefully allow the action to run the entire action without timing out on our larger TLDs.
* Unchange default config
Cache loads will likely always be inner transactions, if they have a transaction at all. Cache loads do not always call a transaction since they are only necessary if the cache is not fresh at the time it is called. Since the cache itself needs to decide whether or not a DB transaction is necessary, it should use the reTransact method to safely indicate that the isolation level of the outer transaction is what should be used.
If the cert(s) are invalid or expired that's a problem, but that
shouldn't necessarily prevent us from changing other things. If we're
not changing the certs, leave them alone.
If we don't explicitly handle random unexpected exceptions, the error
that the front end receives includes a big ole stacktrace, which is
unhelpful for regular users and possibly bad to expose. Instead, we
should provide a vague "something went wrong" message.
Separately, we can create a default SnackBar options and use that (we
want it longer than 1.5 seconds because that's pretty short).
Also made some refactoring to various Auth related classes to clean up things a bit and make the logic less convoluted:
1. In Auth, remove AUTH_API_PUBLIC as it is only used by the WHOIS and EPP endpoints accessed by the proxy. Previously, the proxy relies on OAuth and its service account is not given admin role (in OAuth parlance), so we made them accessible by a public user, deferring authorization to the actions themselves. In practice, OAuth checks for allowlisted client IDs and only the proxy client ID was allowlisted, which effectively limited access to only the proxy anyway.
2. In AuthResult, expose the service account email if it is at APP level. RequestAuthenticator will print out the auth result and therefore log the email, making it easy to identify which account was used. This field is mutually exclusive to the user auth info field. As a result, the factory methods are refactored to explicitly create either APP or USER level auth result.
3. Completely re-wrote RequestAuthenticatorTest. Previously, the test mingled testing functionalities of the target class with testing how various authentication mechanisms work. Now they are cleanly decoupled, and each method in RequestAuthenticator is tested individually.
4. Removed nomulus-config-production-sample.yaml as it is vastly out of date.
Surround the dot in unsafe domain names with a square bracket. This
is suggested by Gmail abuse-detection and allows outgoing messages
to pass Gmail's check. This should also help with recipients' checks.
Adds a delay between emails sent in a tight loop. This helps avoid
triggering Gmail abuse detections.
Also updated the recipient address for billing alerts.
* Add custom YAML serializer for Duration
This addresses b/301119144. This changes the YAML representation of a TLD to show Duration fields as a String reperesntation using the Java Duration object's toString() format. This eliminates the previous ambiguity over the time unit that is being used for each duration.
* change standardSeconds to standardMinutes in test
* Add custom serializer to the entire mapper
Unless an --oauth flag is used, the nomulus tool will only send the OIDC
header. The server still accepts both headers and the user should use
`create_user` command to create an admin User (with the --oauth flag on), which
will then allow one to use the nomulus tool without the --oauth flag.
The --oauth flag and the server's ability to support OAuth-based
authentication will be removed soon. Users are urged to create the User
object in time to avoid service interruption.
TESTED=verified on alpha.
We want to do this because it takes the user to an external site, which
could potentially lead to confusion if they tried to use the back button
without a new tab.
Add documentation that describes the current Cloud Build status notification
to Google Chat, as well as how to update the configuration and the
notifier service.
This isn't the prettiest thing, but it replicates the type of view /
edit functionality that we had in the original console.
Of note: this doesn't include input field validation, which would
probably be a good idea to add at some point.
It will call equalsImmutableObject(), which seems the right thing to do.
We only care if the two Tld objects have the same fields, not if they
are the same object. ErrorProne complained about comparison by identity.
* Defend against deserialization-based attacks
Added the `SafeObjectInputStream` class that defends attacks using
malformed serialized data, including remote code execution and
denial-of-service attacks.
Started using the new class to handle EPP resource VKeys and
PendingDeposits, which are passed across credential-boundaries: between
TaskQueue and AppEngine server, and between AppEngine server and the RDE
pipeline on GCE. Note that the wireformat of VKeys do not change,
therefore existing tasks sitting in the TaskQueue are not affected.
Also removed an unused class: JaxbFragment.
This token is only ever used for logging. The GAE OAuth service will
parse the header directly when called to retrieve the current user and
user id. Logging it in prod could be a security risk if the logs are
leaked.
Previously this didn't properly deal with nested routings, e.g.
"settings/whois". It tried to just pass "whois" as the next url which
doesn't work with the router because it's nested under the settings.
Using all parts of the URL allows us to handle the nesting.
* Add a configureTld command that uses YAML
* Add more tests and edge case handling
* Add out of order test and fix wrong inject
* small changes
* Add check for ascii
* Add check for ROID suffix
Hibernate logs certain information at the ERROR level, which for the
purpose of troubleshooting is misleading, since most affected operations
succeed after retry. ERROR-level logging should only be added by Nomulus
code.
This PR does two things:
1. Disable all logging in two Hibernate classes: we cannot disable
logging at a finer granularity, and we cannot preserve lower-level
logging while disabling ERROR.
2. Adds a DatabaseException which captures all error details that may
escape the typical loggers' attention: SQLException instances can be
chained in a different way from Throwable's `getCause()` method.
The old semantics for TransactionalFlow meant "anything that needs to mutate the
database", but then FlowRunner was not creating transactions for
non-transactional flows even though nearly every flow needs a transaction (as
nearly every flow needs to hit the database for some purpose). So now
TransactionalFlow simply means "any flow that needs the database", and
MutatingFlow means "a flow that mutates the database". In the future we will
have FlowRunner use a read-only transaction for TransactionalFlow and then a
normal writes-allowed transaction for MutatingFlow. That is a TODO.
This also fixes up some transact() calls inside caches to be reTransact(), as we
rightly can't move the transaction outside them as from some callsites we
legitimately do not know whether a transaction will be needed at all (depending
on whether said data is already in memory). And it removes the replicaTm() calls
which weren't actually doing anything as they were always nested inside of normal
tm()s, thus causing confusion.
In the future, reTransact() will be the only way to initiate a transaction that
doesn't fail when called inside an outer wrapping transaction (when wrapped,
it's a no-op). It should be used sparingly, with a preference towards
refactoring the code to move transactions outwards (which this PR also
contains).
Note that this PR includes some potential efficiency gains caused by existing
poor use of transactions. E.g. in the file RefreshDnsAction, the existing code
was using two separate transactions to refresh the DNS for domains and hosts
(one is hidden in loadAndVerifyExistence(), whereas now as of this PR it has a
single wrapping transaction to do so.
It turns out that disallowing all nested transaction is impractical. So
in this PR we make it possible to run nested transactions (which are not
really nested as far as SQL is concerned, but rather lexically nested
calls to tm().transact() which will NOT open new transactions when
called within a transaction) as long as there is no conflict between the
specified isolation levels between the enclosing and the enclosed
transactions.
Note that this will change the behavior of calling tm().transact() with
no isolation level override, or a null override INSIDE a transaction.
The lack of the override will allow the nested transaction to run at
whatever level the enclosing transaction runs at, instead of at the
default level specified in the config file.
* Fix Cloud Tasks failure to retry
Replace `SC_NOT_MODIFIED` (304) with `SC_SERVICE_UNAVAILABLE` (503) when
data is not available yet. Affected actions are invoice- and spec11-publishing.
It is confirmed that Cloud Tasks currently does not retry with code 304,
despite the public documentation stating so. We will use 503 for now,
pending the decision by Cloud Tasks whether to change behavior or
documentation.
The code `TOO_EARLY` (425) is another alternative. It is not meant for
our use case but at least sounds like it is. However, it is not in any
javax.servlet jar. We don't want to define our own constant, and we cannot upgrade
to jakarta.servlet yet.
Also revert previous mitigation.
This includes a bit of refactoring of the GSON creation. There can exist
some objects (e.g. Address) where the JSON representation is not equal to the
representation that we store in the database. For these objects, when
deserializing, we should update the objects so that they reflect the
proper DB structure (indeed, this is already what we do for the XML
parsing of Address).
* Change PackagePromotion to BulkPricingPackage
* More name changes
* Fix some test names
* Change token type "BULK" to "BULK_PRICING"
* Fix missed token_type reference
* Add todo to remove package type
A config file field is added to control if per-transaction isolation
level is actually used. If set to true, nested transactions will throw
a runtime exception as the enclosing transaction may run at a different
isolation level.
In this PR we only add the ability to specify the isolation level,
without enabling it in any environment (including unit tests), or
actually specifying one for any query. This should allow us to set up
the system without impacting anything currently working.
* Mitigate Cloud task retry problem
Increase PublishSpec11Action start delay to avoid the need to retry.
The only other use case is invoice, which typically does not retry:
delay is 10 minutes, pipeline finishes within 7 minutes.
This includes two changes:
1. Creating a base string-type adapter for use parsing to/from JSON
classes that are represented as simple strings
2. Changing the object-provider methods so that the POST bodies should
contain precisely the expected object(s) and nothing else. This way,
it's easier for the frontend and backend to agree that, for instance,
one POST endpoint might accept exactly a Registrar object, or a list
of Contact objects.
Co-Authored-By: gbrodman <gbrodman@google.com>
This includes two changes:
1. Creating a base string-type adapter for use parsing to/from JSON
classes that are represented as simple strings
2. Changing the object-provider methods so that the POST bodies should
contain precisely the expected object(s) and nothing else. This way,
it's easier for the frontend and backend to agree that, for instance,
one POST endpoint might accept exactly a Registrar object, or a list
of Contact objects.
2.25.0 contains a breaking change that made HttpStorageOptions not
serializeable, which breaks RDE as it needs to access GCS from Beam.
2.22.6 was the last version that was used before the Gradle upgrade.
Also had to downgrade google-cloud-nio to pass the tests.
For some inexplicable reason, I had to manually add
guava-listenablefuture as
testRuntimeClasspath/runtimeClasspath/deploy_jar dependencies to the
networking, docs and prober subprojects' lock files, as running
`gradle test --write-locks` would NOT add them and succeed; but without
`--write-locks`, running the corresponding tests would fail.
See: b/294378137.
Use a system property to specify whether this check should be executed.
We will update the presubmit test script to run this check only during
foss-pr.
Add placeholder configs for sending emails using Gmail in GSuite.
The names of the new configs are temporary. After migration they
will revert to the names currently in use by the AppEngine email API.
The instance ID used to be uniquely determined by App Engine SDK. Since
we no longer calls the SDK, we need a way to distinguish instances so
that their metrics would not stump on each other and result in a time
inversion error (as we have seen frequently in the logs since the
removal of the App Engine SDK).
This includes removing (hopefully temporarily) the gradle-lint plugin as
it is incompatible with various Gradle versions (see
https://github.com/nebula-plugins/gradle-lint-plugin/issues/393). This
is somewhat unfortunate since the plugin is useful for removing unused
dependencies, though with the relatively small amount of Gradle code we
write hopefully it will not be missed much. If Nebula changes their
code to be compatible with Gradle 8+, we can re-add it easily.
This upgrade means we can remove the code added in 342051e1.
* Use Jackson to create and read Tld YAML files
* Add getObjectMapper to TldYamlUtils
* revert lockfiles
* Fix optionals
* Add more tests and javadocs
* small fixes
This is part of the spec in RFC 5730 that we hadn't implemented until now. Note
that this requires changing LoginFlow to be transactional, but I don't think
that should cause any issues.
It was only called in one place (in actual production code), and it was just
slightly obscuring the fact that re-saves can be scheduled for multiple points
in the future in a way that wasn't amazingly helpful to understanding of the
system logic at the callsite.
This includes two changes, the second necessary for testing the first.
1. We add the rdap-queries field as mandated by the amendment to the
registry agreement,
https://itp.cdn.icann.org/en/files/registry-agreement/proposed-global-amendment-base-gtld-registry-agreement-12-04-2023-en.pdf.
This is fairly similar to the whois-queries field where we just query
the logs, but instead of searching for "whois" we search for "rdap".
2. BigQuery doesn't use MAX to refer to the bigger of two fields; MAX
accepts an array as an argument. In order to do what we want (and to
have the BigQuery statements succeed), we need to use GREATEST.
Tested both versions in alpha and production BigQuery instances.
This is part of b/247839944 as a followup to the large bug from
September 2022. As a result of that, there are domains whose
BillingRecurrence objects were closed but the domain wasn't deleted. In
order to avoid having these domains stick around forever without being
billed, we want to restart billing on them whenever their next billing
cycle would have been.
See b/290228682, there are edge cases in which the net_renew would be negative when
a domain is cancelled by superusers during renew grace period. The correct thing
to do is attribute the cancellation to the owning registrar, but that would require
changing the owing registrar of the the corresponding cancellation DomainHistory,
which has cascading effects that we don't want to deal with. As such we simply
floor the number here to zero to prevent any negative value from appearing, which
should have negligible impact as the edge cage happens very rarely, more specifically
when a cancellation happens during grace period by a registrar other than the the
owning one. All the numbers here should be positive to pass ICANN validation.
See b/248035435 for more details / reasoning, but basically this will
make it easier if we ever need to restore user actions in the future (or
figure out which user actions went wrong)
It was used by cron job and task queues, which now use OIDC-based auth.
Also renamed and consolidated auth enums to make them easier to
understand. Ultimately we should get rid of the AuthMethod part as OIDC
will be the only auth method used.
Based on the updated routing map:
Backend and tools: the only change is that INTERNAL is removed from allowed
auth methods. Should be an no-op.
Pubapi: INTERNAL is removed from allowed auth. For endpoints that only
allowed INTERNAL before, API and LEGACY become the allowed methods.
However this should not affect anything because regardless of which auth
method is ultimately used, the required auth level is always NONE for
pubapi endpoints. Therefore any auth result is discarded anyway.
Frontend: INTERNAL is removed. RegistryLockVerifyAction has lowered
its required auth level to NONE because it extends HtmlAction, which can
redirect the user to login if necessary. All other endpoints extending
HtmlAction require NONE, so it's better to keep things consistent.
Instead of using REDACTED FOR PRIVACY everywhere we should just include
the empty string (this is what the spec says, what other gTLD registrars
do, and what the RDAP conformance tool at
https://github.com/icann/rdap-conformance-tool says to do.
In the contact VCards, we omit redacted fields entirely unless the spec
requires that they exist (the version number and an empty 'fn' field).
This also applies to the "handle" field.
Eventually we will probably want to add the redaction extension but
that's not RFCed yet and isn't required for the August RDAP conformance
deadline.
It was previously calling toString() on an Optional<PremiumList> which was
unnecessarily verbose. The existing premium list is required to be present
anyway.
This is a follow-on to comments in PR #2037. It makes the main loop cleaner and
also removes ambiguities around database handling when the first query is run
with the cursor still empty because no results have been found yet.
This PR changes the two flavors of OIDC authentication mechanisms to
verify the same audience. This allows the same token to pass both
mechanisms. Previously the regular OIDC flavor uses the project id as
its required audience, which does not work for local user credentials
(such as ones used by the nomulus tool), which requires a valid OAuth
client ID as audience when minting the token (project id is NOT a valid
OAuth client ID).
I considered allowing multiple audiences, but the result is not as clean
as just using the same everywhere, because the fall-through logic would
have generated a lot of noises for failed attempts.
This PR also changes the client side to solely use OIDC token whenever
possible, including the proxy, cloud scheduler and cloud tasks. The nomulus
tool still uses OAuth access token by default because it requires USER level
authentication, which in turn requires us to fill the User table with objects
corresponding to the email address of everyone needing access to the tool.
TESTED=verified each client is able to make authenticated calls on QA with or
without IAP.
* Add an includeDeleted option to RefreshDnsForAllDomainsAction
* Add batching to the query
* Some refactoring
* Make batch size configurable
* Set status to ok
* Combine into one transaction
* Remove smear mintes parameter
* Only pass in lastInPreviousBatch
Add a --canary option (default to false) to the CurlCommand that allows
connection to the canary endpoints.
During canary analysis, only the DEFAULT-canary receives traffic. This
new flag allows use to test other canary services manually using the
curl command.
* Add Gmail Client and set up tests
Add a Gmail client and manually triggered email tests in
CannedScriptExecutionActon.
We will test Gmail with Google Workspace in Sandbox, since Alpha and
Crash are not properly set up for Google Workspace, and we have not
figured out why.
* Remove nested transaction from requestDnsRefresh
* Add a bulk version
* Remove transaction time as field
* Only add delay once
* have PublishDnsUpdatesAction use bulk refresh
The Java code will be added in a followup PR.
Also fixed tests failing due to org.json upgrade: decimal whole numbers
no longer have their fractional parts removed, so currency value strings
must end with ".00" instead of ".0".
When RdeReportAction is invoked without a prefix parameter (as in the
case when it is kicked off by cron jobs for potential catch ups), we
need to used the same heuristics that's employed in RdeUploadAction to
find the most recent prefix for the given watermark, otherwise the job
will not find any deposits to upload.
Also renamed RdeUtil to RdeUtils, to be consistent with our naming
conventions.
IAP and regular OIDC auth mechanisms are unified under a base class that
produces either APP or USER level AuthResult based on the principal email
found in the OIDC token.
Also moved some enum classes to better organize code structure.
This encompasses most of the basic information that is viewable in the
existing console, basically, just viewing the base info of the Registrar
object.
Use the ApplicationDefaultCredential annotation instead.
The new annotation has been verified in sandbox and production using the
'executeCannedScript' endpoint. The verification code is removed in this
PR too.
This adds a possible configuration point "defaultServiceAccount" (which
in GAE will be the standard GAE service account). If this is configured,
CloudTasksUtils can create tasks with standard HTTP requests with an
OIDC token corresponding to that service account, as opposed to using
the AppEngine-specific request methods.
This also works with IAP, in that if IAP is on and we specify the IAP
client ID in the config, CloudTasksUtils will use the IAP client ID as
the token audience and the request will successfully be passed through
the IAP layer.
Tetsted in QA.
This column is used by the billing team to create invoices. Registrars
have asked that a single invoice be created for a given registrar,
instead of one per registrar-tld pair. This should have no other effect
on the billing pipeline as the invoice grouping key has a description
field that also contains the TLD, so the granularity as a whole does not
change.
We have been using it as a poor man's timed flag that triggers a system
behavior change after a certain time. We have no foreseeable future use
for it now that the DNS pull queue related code is deleted. If in the
future a need for such a flag arises, we are better off implementing a
proper flag system than hijacking this class any way.
* Prepare switch of credential annotation
Prepare the switch from DefaultCredential to ApplicationCredential.
In nomulus tools, start using the new annotation. This is tested by
successfully using the nomulus curl command, which actually needs a
valid credential to work.
For remaining use cases of the old annotation in Nomulus server, add
some code that relies on the new credential to work. Once these code
are tested in sandbox and production, we will switch the annotations.
If the user does, e.g. `--allowed_nameservers=` (or contact ids) that
shouldn't mean a list consisting solely of the empty string.
Using this parameter / converter allows us to ensure that lists of
strings look reasonable.
This includes renaming the billing classes to match the SQL table names,
as well as splitting them out into their own separate top-level classes.
The rest of the changes are mostly renaming variables and comments etc.
We now use `BillingBase` as the name of the common billing superclass,
because one-time events are called BillingEvents
* Allow rotation when updating registrar cert
When updating a registrar's primary cert, add a flag to activate
rotation of previous primary cert to failover.
This functionality is part of the prober ssl cert renewal automation.
The only method that is called from this class is setNumInstances. However we
don't current use `nomulus set_num_instances` anywhere. If we need to change
the number of instances, it is either done by updating appengine-web.xml, which
is deployed by Spinnaker, or doing it manually as a break-glass fix via gcloud
or on Pantheon.
Because we need to check if a contact history is the most recent for its
underlying contact resource, the query-wipe out-repeat loop no longer works
ideally due to the added overhead with the query.
Instead, we refactor the logic into a Beam pipeline where the query only
needs to be performed once and history entries eligible for wipe out are
handled individually in their own transforms. Because history entries
are otherwise immutable, we can run the pipeline in relatively relaxed
repeatable read isolation level. We also do not worry about batching for
performance, as we do not anticipate this operation to put a lot of
strains on the particular table.
This also adds README files to explain the two different IDN table locations
(which have different purposes). See http://b/278565478 for more information.
This includes changes to make sure that we use the proper per-TLD IDN
tables as well as setting/updating/removing them via the Create/Update
TLD commands.
There can be situations (anchor tenants, test tokens, other ways of
getting a domain to cost $0) where we may want to delete a domain during
the add grace period but the credit applied is 0. We should not fail on
those cases.
See b/277115241 for an example.
This new table has just been approved by ICANN. It is the same as our existing
Extended Latin table, except with the removal of some lesser-used characters
with diacritic marks that are confusable variants.
The filenames for the IDN tables are made explicit to improve code readability.
And this reverses the removal of G with stroke from the existing Extended Latin
table (see PR #1938), so that that table continues to accurately reflect the
state of our previously launched TLDs.
This is the full list of removed characters:
U+00E1 # LATIN SMALL LETTER A WITH ACUTE
U+0101 # LATIN SMALL LETTER A WITH MACRON
U+01CE # LATIN SMALL LETTER A WITH CARON
U+010B # LATIN SMALL LETTER C WITH DOT ABOVE
U+01E7 # LATIN SMALL LETTER G WITH CARON
U+0123 # LATIN SMALL LETTER G WITH CEDILLA
U+01E5 # LATIN SMALL LETTER G WITH STROKE
U+0131 # LATIN SMALL LETTER DOTLESS I
U+00ED # LATIN SMALL LETTER I WITH ACUTE
U+00EF # LATIN SMALL LETTER I WITH DIAERESIS
U+01D0 # LATIN SMALL LETTER I WITH CARON
U+0144 # LATIN SMALL LETTER N WITH ACUTE
U+014B # LATIN SMALL LETTER ENG
U+00F3 # LATIN SMALL LETTER O WITH ACUTE
U+014D # LATIN SMALL LETTER O WITH MACRON
U+01D2 # LATIN SMALL LETTER O WITH CARON
U+0157 # LATIN SMALL LETTER R WITH CEDILLA
U+0163 # LATIN SMALL LETTER T WITH CEDILLA
U+00FA # LATIN SMALL LETTER U WITH ACUTE
U+00FC # LATIN SMALL LETTER U WITH DIAERESIS
U+01D4 # LATIN SMALL LETTER U WITH CARON
U+1E83 # LATIN SMALL LETTER W WITH ACUTE
U+1E81 # LATIN SMALL LETTER W WITH GRAVE
U+1E85 # LATIN SMALL LETTER W WITH DIAERESIS
U+1EF3 # LATIN SMALL LETTER Y WITH GRAVE
U+017C # LATIN SMALL LETTER Z WITH DOT ABOVE
Makes the next run at the first Monday of December, which should give us
plenty of time to fix the issue with it wiping out PII in the most recent
contact history.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1982)
<!-- Reviewable:end -->
* Check allowedEppActions when validating tokens
* Reflect failed tokens in the fee check
* Add tests for domainCheckFlow
* Add hyphens to fee class name
* Add clarifying comment to catch block
* Add specific exception types
* Implement default tokens for the fee extension in domain check
* Add test for expired token
* Add test for alloc token and default token
* Fix formatting
* Always check for default tokens
* Change transaction time to passed in DateTime
Also adds a DnsUtils class to deal with adding, polling, and removing
DNS refresh requests (only adding is implemented for now). The class
also takes care of choosing which mechanism to use (pull queue vs. SQL)
based on the current time and the database migration schedule map.
When submitting tasks to Cloud Tasks, we will use the built-in OIDC
authentication which runs under the default service account (not the
cloud scheduler service account). We want either to work for app-level
auth.
This does nothing for now, but in the future this will allow us to refer
to the RegistryConfig and/or Service objects from the core project. This
will be necessary when changing CloudTasksUtils to not use the AppEngine
built-in connection (it will need to use a standard HTTP request
instead).
It seems like we're hitting App Engine capacity issues resulting in actual pages
now (for whatever reason, but likely one customer), and we obviously don't want
that.
The value of the column would be set to START_OF_TIME for new entries.
Every time a row is read, the value is updated to the current time. This
allows concurrent reads to not repeatedly read the same entry that has the
earliest request time, because they would only look for rows that have a value
of process time that is before current time - some padding time.
This basically fulfills the same function that LEASE_PADDING gives us
when using a pull queue, whereas a task would be leased for a certain
time, during which time they would not be leased by anyone else.
See: https://cs.opensource.google/nomulus/nomulus/+/master:core/src/main/java/google/registry/dns/ReadDnsQueueAction.java;l=99?q=readdnsqueue&ss=nomulus%2Fnomulus
This PR adds an alternative method to upload Lordn to Nordn server without
using App Engine pull queue. A new database migration stage is added to control
whether a new task is scheduled with the old or new method. The
NordnUploadAction is configured to process both kind of tasks. Once the tasks
scheduled for the old tasks are all processed, we can start using the
new method exclusively.
See: go/registry-pull-queue-redesign
There is no point comparing the old CRL to the new ones when the old one
is invalid. This could happen when the CA cert rotates, after which the
old CRL stop being valid as it fails signature verification against the
new cert.
This change will allow us to keep updating the CRL after a CA rotation without
having to manually delete the old CRL from the database.
See b/270983553.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1946)
<!-- Reviewable:end -->
ICANN doesn't like this character because it's confusable with a normal G (the
stroke tends to get lost in the visual clutter of the descender), and .com's
Extended Latin table doesn't use it either. Best to get rid of it.
For some reason `./gradlew clean build` on master is failing for me on
multiple machines due to a new org.json:json version triggering license
violations, even though the lock files are not changing.
Note that the old versions are still present because if I remove
"The JSON license", which the old versions use, the check also fails...
We might (likely will) modify some of the fiddly bits around this (maybe
the GSON serialization, where we do the actual authorization, etc) but
this should be a decent basic shell structure for endpoints that the new
registrar console can call to retrieve JSON results.
This is only generated and used if "iapClientId" is set in the proxy
config. If so, we use code similar to
https://cloud.google.com/iap/docs/authentication-howto#obtaining_an_oidc_token_for_the_default_service_account
to generate an ID token that is valid for IAP. We set the token on the
Proxy-Authorization header so that we can keep using the pre-existing
access token as well -- IAP allows for us to use either the
Authorization header or the Proxy-Authorization header.
We were under the mistaken impression before that there was a reliable
way to, out-of-band, get a GAIA ID for a particular email address.
Unfortunately, that isn't the case (at least, not in a scalable way or
one that support agents could use). As a result, we have to allow null
GAIA IDs in the database.
When we (or the support team) create new users, we will only specify the
email address and not the GAIA ID. Then, when the user logs in for the
first time, we will have the GAIA ID from the provided ID token, and we
can populate it then.
See b/260945047.
Also refactored the corresponding tests, which should future updates easier.
This change should be deployed at or around 2023-02-15T16:00:00Z.
* Modify DomainCreateFlow to check for an applicable defaultPromoToken
* Add handling for deleted tokens
* Change cache to allocation token cache
* Abstract away cache methods
* Use AllocationToken.getAll in create flow
* Filter out empty tokens
This is an 'easy' upgrade that requires a minor change in
common/build.gradle and the removal of an unnecessary import in buildSrc.
Gradle 7.4 and above has breaking changes that break the latest nebula lint plugin. We may have to wait a while.
Due to the way the beam pipeline is designed, it will expand an
recurring billing event when its event time is in scope for expansion,
instead of billing time. This means that the one time will be generated
45 days earlier. This would negate the need to check if the expansion is
finished when generating monthly invoices.
We will need to backfill the past 45 days of onetimes before the new
code is deployed. As an illustration, with the old code, a cursor time
of 2023-01-17 means that all auto-renewals whose billing time is before
2023-01-17 were created, which corresponds to an effective cursor time
of 2022-12-03 (45 days before 2023-01-17) for event time. This cursor
will need to be brought to 2023-01-17 to ensure that there is no gap in
generated event times when switching to use the new code.
Currently we synthesize a instance id which requires the use of App
Engine Module API. Given that we only have one version of code running
at one time, and HTTP is stateless, there is no point tracking exactly
which GAE "instance" is. We do lose information on which service (default,
backend, etc) is writing the metric, but that does not seem very
important.
Using a constant fake instance ID allows us to get rid of another GAE
dependency.
The temporary queue.xml file is not deleted in the afterEach() method,
likely causing some flaky tests that we saw due to overwriting of the
file by concurrent tests.
Most of its usage can be replaced by JpaIntegrationTestExtension. In
places where specific GAE APIs are still needed, namely when pull queue
or the User service is used, two simplifed extensions are used, which
makes them much easier to identify when the APIs are no longer used.
We've been using it in production for three weeks now. Everything seems
to be working fine. Removing the code related to checking the migration
state and using the override.
This will replace the ExpandRecurringBillingEventsAction, which has a
couple of issues:
1) The action starts with too many Recurrings that are later filtered out
because their expanded OneTimes are not actually in scope. This is due
to the Recurrings not recording its latest expanded event time, and
therefore many Recurrings that are not yet due for renewal get included
in the initial query.
2) The action works in sequence, which exacerbated the issue in 1) and
makes it very slow to run if the window of operation is wider than
one day, which in turn makes it impossible to run any catch-up
expansions with any significant gap to fill.
3) The action only expands the recurrence when the billing times because
due, but most of its logic works on event time, which is 45 days
before billing time, making the code hard to reason about and
error-prone. This has led to b/258822640 where a premature
optimization intended to fix 1) caused some autorenwals to not be
expanded correctly when subsequent manual renews within the autorenew
grace period closed the original recurrece.
As a result, the new pipeline addresses the above issues in the
following way:
1) Update the recurrenceLastExpansion field on the Recurring when a new
expansion occurs, and narrow down the Recurrings in scope for
expansion by only looking for the ones that have not been expanded for
more than a year.
2) Make it a Beam pipeline so expansions can happen in parallel. The
Recurrings are grouped into batches in order to not overwhelm the
database with writes for each expansion.
3) Create new expansions when the event time, as opposed to billing
time, is within the operation window. This streamlines the logic and
makes it clearer and easier to reason about. This also aligns with
how other (cancelllable) operations for which there are accompanying
grace periods are handled, when the corresponding data is always
speculatively created at event time. Lastly, doing this negates the
need to check if the expansion has finished running before generating
the monthly invoices, because the billing events are now created not
just-in-time, but 45 days in advance.
Note that this PR only adds the pipeline. It does not switch the default
behavior to using the pipeline, which is still done by
ExpandRecurringBillingEventsAction. We will first use this pipeline to
generate missing billing events and domain histories caused by
b/258822640. This also allows us to test it in production, as it
backfills data that will not affect ongoing invoice generation. If
anything goes wrong, we can always delete the generated billing events
and domain histories, based on the unique "reason" in them.
This pipeline can only run after we switch to use SQL sequence based ID
allocation, introduced in #1831.
We can use the saved refresh token associated with the nomulus tool to
request an ID token with an audience of the IAP client in order to
satisfy IAP with with the Nomulus tool.
Note: this requires that the user of the Nomulus tool, e.g.
"gbrodman@google.com" has a User object stored in SQL.
Tested on QA
This is similar to where we store the SQL files for beam pipelines, and
frankly makes more sense. Also streamlined the use of the API to read
SQL files from a jar.
The AppEngine thread factory is only useful if we can't create our own
(this is no longer the case) or if we need access to AppEngine APIs
(this is no longer the case).
The Concurrent class is only used by the DNS writer and the
CreateGroupsAction.
This parameter is misleading and does not do what it purports to do.
Namely, it does not impact the level of parallelism. Given the input n for this
parameter, and m for the batch size, the elements are divided (keyed) into n
groups, each of which are then spread evenly across all threads, which
are eventually in turn batched into batches with size m:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L227
This is also evident in the implementation itself, where the ShardedKey
is determined by the unique number for a worker/thread combo and the
original key:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L268
Using a more concrete example, suppose we have 100 elements and 10
worker threads, with a target batch size of 5. If the "shard" number is set to
1, we first spread the 100 elements across 10 threads, resulting in 10
elements per thread, each thread then batches the elements into 2
batches of size 5.
If the "shard" number is set to 2, the 100 elements are first divided into 2
"shards" of 50 each. Each "shard" is then distributed within the 10
threads, resulting in 5 elements per "shard" per thread. They then get
turned into 1 batch per "shard" per thread. In the end, each thread
still processes 2 batches, even though they are from 2 different "shards".
Therefore this "shard" number does not perform horizontal partitioning
that one normally associates with sharding, and provides no
performance benefits but rather confuses the user.
It is also suggested that using withShardedKey() alone is already
sufficient to achieve auto-sharding within the keyed group. There is no
need to manually divide the input by keying them differently based on
the "shard" number specified:
https://youtu.be/jses0W4Zalc?t=967
Some retriers are no longer needed because transactions are
automatically retried by the JPA transaction manager when there's a
transient exception.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1874)
<!-- Reviewable:end -->
We no longer use App Engine Remote API as of #1858.
The pipeline endpoint is only for GAE mapreduce, which we stopped doing
for a while.
TESTED=deployed to alpha and used nomulus tool built from master to
connect to the tools service.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1873)
<!-- Reviewable:end -->
* Add defaultPromoTokens to Registry
* Remove flyway files from this PR
* Fix merge conflicts
* Add back flyway file
* Add more info to error messages
* Change to a list
* Fix javadoc
* Change error message
* Add note to field declaration
These were always properly reflected in the actual creations, but when
running a check flow it would still show a non-zero cost even when using
an ANCHOR_TENANT allocation token. This changes it so that we accurately
show the $0.00 cost.
This is only used for contacting Datastore. With the removal of:
1. All standard usages of Datastore
2. Usage of Datastore for allocation of object IDs
3. Usage of Datastore for GAE user IDs
we can remove the remote API without affecting functionality.
This also allows us to just use SQL for every command since it's lazily
supplied. This simplifies the SQL setup and means that we remove a
possible situation where we forget the SQL setup.
So long, farewell, adios, ciao, sayonara, 再见!
TESTED=deployed to alpha and used `nomulus list_tlds` to confirm that the web app can receive and serve requests.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1863)
<!-- Reviewable:end -->
With newer versions of Java 11, javadoc fails to build due to unknown
tags in package-info.java files. These files are not important so we
exclude them.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1866)
<!-- Reviewable:end -->
Switch to using the login email address instead of GAE user ID to
identify console users. The primary use cases are:
1) When the user logged in the registrar console, need to figure out
which registrars they have access to (in
AuthenticatedReigstrarAccess).
2) When a user tries to apply a registry lock, needs to know if they
can (in RegistryLockGetAction).
Both cases are tested in alpha with a personal email address to ensure
it does not get the permission due to being a GAE admin account.
Also verified that the soy templates includes the hidden login email
form field instead of GAE user ID when registrars are displayed on the
console; and consequently when a contact update is posted to the server,
the login email is part of the JSON payload. Even though it does not
look like it is used in any way by RegistrarSettingsAction, which
receives the POST request. Like GAE user ID, the field is hidden, so
cannot be changed by the user from the console, it is also not used to
identify the RegistryPoc entity, whose composite keys are the contact
email and the registrar ID associated with it.
The login email address is backfilled for all RegistrarPocs that have a
non-null GAE user ID. The backfilled addresses converted to the same ID
as stored in the database.
It doesn't actually use any App Engine libraries or code -- it's just a
generic connection with authentication to a service. This also involves
changing that block of config to be "gcpProject" instead of "appEngine"
since it's more generic.
Note: this will require an internal PR as well to change the
corresponding private config block
* Remove package token on manual transfer approval
* remove extra variables
* Add back white space
* Don't overwrite existingDomain
* Format fixes, use available helper variables
* Use PACKAGE allocation tokens in tests
* Refactor
* Fix merge conflicts
* Dont overwrite existingRecurring
* Remove names from packages on automatic transfers
* Add more tests
* Remove unneccesary local variable
* Eliminate unnecessary api call
* Reformat if blocks
* Don't overwrite existingRecurring
This PR fixes the issue where the onetime billing event for an autorenew
is not correctly created if the recurrence of the autorenew is closed
during the autorenew grace period, such as the case if a manual renew
happens during the same grace period.
The detailed analysis of the issue is captured in b/258822640. Note that
this is a quick and dirty fix to make ongoing billing event expanse work
correctly in the future. It does not fix the missing events in the past,
nor can it be used to reconstruct the missing ones (by providing a
different cursor time), due to timeout when triggering the action from
nomulus curl.
Per Weimin, the recurrences that fits the new condition along, based on
the current cursor, would increase from 382k to 430k, a 12% increase. I
checked last nights cron job run, which starts on 22:00 EST and seemed
to finish at 22:15 EST (when the last log for this request was
recorded), so it should definitely still finish in time for the nightly
runs with the new condition.
This is not a complete removal of ofy as we still a dependency on it
(GaeUserIdConverter). But this PR removed it from a lot of places where
it's no longer needed.
* Fix Gradle dependency version pinning
In Gradle 7, version labels require '!!' at the end to be free from
any forced upgrade.
Hibernate min version needs to be advanced past 5.6.12, which is buggy.
Upgraded most dependencies to the latest version.
Therefore this PR also removed several classes and related tests that
support the setup and verification of Ofy entities.
In addition, support for creating a VKey from a string is limited to
VKey<? extends EppResource> only because it is the only use case (to
pass a key to an EPP resource in a web safe way to facilitate resave),
and we do not want to keep an extra simple name to class mapping, in
addition to what persistence.xml contains. I looked into using
PersistenceXmlUtility to obtain the mapping, but the xml file contains
classes with the same simple name (namely OneTime from both PollMessage
and BillingEvent). It doesn't seem like a worthwhile investment to write
more code to deal with that, when the fact is that we only need to
consider EppResource.
* Remove Cloud KMS from Nomulus Server
Removed Cloud KMS from the Nomulus (:core) since it is no longer used.
Renamed remaining classes to reflect their use of the SecretManager.
Updated the config instructions to use a new codename for the keyring:
KMS to CSM. This PR works with both codenames. Will drop 'KMS' after
the internal repo is updated.
Add a implementation of Delegated credential without using downloaded private key.
This is a stop-gap implementation while waiting for a solution from the Java auth library.
Also added a verifier action to test the new credential in production. Testing is helpful because:
Configuration is per-environment, therefore, success in alpha does not fully validate prod.
The relevant use case is triggered by low-frequency activities. Problem may not pop out for hours or longer.
This PR removes all Ofy related cruft around `HistoryEntry` and its three subclasses in order to support dual-write to datastore and SQL. The class structure was refactored to take advantage of inheritance to reduce code duplication and improve clarity.
Note that for the embedded EPP resources, either their columns are all empty (for pre-3.0 entities imported into SQL), including their unique foreign key (domain name, host name, contact id) and the update timestamp; or they are filled as expected (for entities that were written since dual writing was implemented).
Therefore the check for foreign key column nullness in the various `@PostLoad` methods in the original code is an no-op as the EPP resource would have been loaded as null. In another word, there is no case where the update timestamp is null but other columns are not.
See the following query for the most recent entries in each table where the foreign key column or the update timestamp are null -- they are the same.
```
[I]postgres=> select MAX(history_modification_time) from "DomainHistory" where update_timestamp is null;
max
----------------------------
2021-09-27 15:56:52.502+00
(1 row)
[I]postgres=> select MAX(history_modification_time) from "DomainHistory" where domain_name is null;
max
----------------------------
2021-09-27 15:56:52.502+00
(1 row)
[I]postgres=> select MAX(history_modification_time) from "ContactHistory" where update_timestamp is null;
max
----------------------------
2021-09-27 15:56:04.311+00
(1 row)
[I]postgres=> select MAX(history_modification_time) from "ContactHistory" where contact_id is null;
max
----------------------------
2021-09-27 15:56:04.311+00
(1 row)
[I]postgres=> select MAX(history_modification_time) from "HostHistory" where update_timestamp is null;
max
----------------------------
2021-09-27 15:52:16.517+00
(1 row)
[I]postgres=> select MAX(history_modification_time) from "HostHistory" where host_name is null;
max
----------------------------
2021-09-27 15:52:16.517+00
(1 row)
```
These alternative ORMs are introduced as a way to make querying large number of
domains and domain histories more efficient through bulk loading from several
to-be-joined tables separately, then in-memory re-assembly of the final entity,
bypassing the need to query multiple tables each time an entity is queried.
Their primary use case is loading these entities for comparison between
datastore and SQL during the migration, which has been completed. The
code remain unused as of now and their existence makes refactoring and
general maintenance more complicated than necessary due to the need to keep
them up to date.
Therefore we remove the related code.
<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1834)
<!-- Reviewable:end -->
Also removed the ability to disable update timestamp auto update as it
was only needed during the migration.
Lastly, rectified the use of raw Coder in RegistryJpaIO.
Passing in an already-existing instance is an antipattern because it can
lead to race conditions where something else modified the object in
between when the pipeline loaded it and when you're saving it. The Write
action should only be writing new entities.
We cannot check IDs for the objects (some IDs are not autogenerated so
they might exist already). We also cannot call `insert` on the objects
because the underlying JPA `persist` call adds the input object to the
persistence context, meaning that any modifications (e.g.
updateTimestamp) are reflected in the input object. Beam doesn't allow
modification of input objects.
* Does not self allocate IDs in Beam by default.
Per b/250948425, it is dangerous to implicitly allow all Beam pipelines
to create buildables by self allocating the IDs. This change makes it so
that one has to explicitly request self allocation in Beam.
A boolean is added to the pipeline option so that it can be passed to
the beam worker initializer that controls the behavior of the JVM on
each worker. Note that we did not add the option in the metadata.json file
because we did not want people to use the override at run time when launching
a pipeline, due to the risk. As shown in RdePipeline.java, we instead
explicitly hard-code the option in the pipeline. There is nothing that
stops one to supply that option when launching the pipeline, but it's
not advised.
Tested=deployed the pipeline alpha and ran it.
* Properly create and use default credential
This PR consists of the following changes:
- Stopped adding scopes to the default credential when using it to access other
non-workspace GCP APIs. Scopes are not needed here.
- Started applying scopes to the default credential when using to access
Drive and Sheets APIs.
- Upgraded Drive access from the deprecated credential lib to the
up-to-date one
- Switched Sheet access from the exported json credential to the
scoped default credential.
This PR requires that the affected files be writable to the default
service account (project-name@appspot.gserviceaccount.com) of the
project.
- This is already the case for exported files (premium terms, reserved
terms, and domain list).
- The registrar sync sheets in alpha, sandbox, and production have been
updated with the new permissions.
All impacted operations have been tested in alpha.
* Properly create and use default credential
This PR consists of the following changes:
- Added a new method to generate scope-less default credential when using it to
access other non-workspace GCP APIs. Scopes are not needed here.
- Started to use the new credential in the SecreteManager.
- Will migrate other usages to this new credential gradually.
- Marked the old DefaultCredential as deprecated.
- Started applying scopes to the default credential when using to access Drive
and Sheets APIs.
- Upgraded Drive access from the deprecated credentials lib
- Switched Sheet access from the exported json credential to the scoped
default credential.
This PR requires that the affected files be writable to the default service
account (project-name@appspot.gserviceaccount.com) of the project.
- This is already the case for exported files (premium terms, reserved terms,
and domain list).
- The registrar sync sheets in alpha, sandbox, and production have been
updated with the new permissions.
All impacted operations have been tested in alpha.
This reverts commit 1ab077d267.
Apparently the new version of Spinnaker that is compatible with this doesn't
work for our release, so we need to roll this back for now. (Again!)
We started getting failures because some of the tests used October. In
general we should freeze the clock for testing as much as possible.
Same thing with the Get*Commands
Basically, any time we're loading a bunch of linked objects that might
change, we want to have REPEATABLE_READ so that another transaction
doesn't come along and smush whatever we think we're loading.
The following instances of READ_COMMITTED haven't changed:
- RdePipeline (it only loads immutable objects like histories)
- Invoicing pipeline (only immutable objects like BillingEvents)
- Spec11 (doesn't use any linked info from Domain)
This also changes the PersistenceModule to use REPEATABLE_READ by
default on the replica JPA TM, for the standard reasoning.
This runs over all domains that weren't deleted as of September 5. This
will fix most of b/248112997, which is itself caused by b/245940594 --
creating synthetic history objects means that the RDE pipeline will look
at those instead of the potentially-no-longer-valid data in the old
history objects.
Also fixed a bug introduced in #1785 where identity checked were performed instead of equality. This resulted in two sets containing the same elements not being regarded as equal and subsequent DNS updated being unnecessarily enqueued.
The old class is modeled after datastore with some logic jammed in for it to work with SQL as well. As of #1777, the ofy related logic is deleted, however the general structure of the class remained datastore oriented.
This PR refactors the existing class into a ForeignKeyUtils helper class that does away wit the index subclasses and provides static helper methods to do the same, in a SQL-idiomatic fashion.
Some minor changes are made to the EPP resource classes to make it possible to create them in a SQL only environment in tests.
`TransferData` is currently a generic class with a complicated type parameter that designate the `Builder` class of its concrete subclass, on order to facilitate returning the said `Builder` from an instance loosely typed to the superclass (`TransferData`) itself.
While this works, in most all places that a `TransferData` is used, the raw, un-generic type is declared, resulting a lot of warnings, not to mention the fact that type safety not actually checked when raw type is used.
In this PR, we make it so that the concrete `Builder` is returned through a protected abstract method that is implemented by the subclasses. The type information therefore no longer needs to be embedded in the superclass type signature, and reflection is not necessary to create the `Builder` either. Overall, it makes `TransferData` a much cleaner class without the messiness of generics.
This uses the GoogleIdTokenVerifier to verify ID tokens passed in
(presumably from a front end) via cookies. This isn't used anywhere yet
but it will be used for front-end API calls for the new console.
First, we create a sequence of User IDs in Postgres and assign it to the
User ID field, meaning that Hibernate can autogenerate IDs.
Next, add an update timestamp.
Next, add a constraint that we can't have multiple Users with the same
email address.
Finally, create a DAO since we'll usually want to query by that email
address (at least for now).
FKI used to be persisted in datastore to help speed up loading by foreign key.
Now it is just a helper class to do the same thing in SQL because
indexing is natively supported in SQL.
- Create a sequence to generate IDs for the user (this allows us to have
Long ID types so that Hibernate can autogenerate IDs)
- Add an update timestamp column so we can extend BackupGroupRoot
- Add a restriction that there can't be multiple users with the same
email address
* Rename ContactResource -> Contact
This is a follow-up to PR #1725 and #1733. Now all EPP resource entity class
names have been rationalized to match with their SQL table names.
This means that LegacyAuthenticationMechanism or a to-be-created
OAuth2AuthenticationMechanism) can return a UserAuthInfo object that
contains either the GAE User or the console User as appropriate. The
goal is that the non-auth flows shouldn't have to know about which user
type it is. Note: the registry lock flow (for now) needs to know about
the separate types of auth because it is a separate level of auth from
the standard AuthenticatedRegistrarAccessor.
The AuthenticatedRegistrarAccessor code is a bit odd because the new
role system doesn't quite fit neatly into the old registrar ->
OWNER,ADMIN system but this is a fine approximation. Basically, any
new registrar role will map to the old OWNER role.
* Add the PackagePromotion table
* Add long id
* Add NOT NULL
* fix formatting
* make package price non null
* Add not nulls to java file
* Fix broken tests from merge conflicts
Also deletes the autorenew poll message history revision id field in
Domain, which is only needed to recreate the ofy key for the poll
message. The column already contains null values in it, making it
impossible to depend on it. The column itself will be deleted from the
schema after this PR is deployed.
The logic to update autorenew recurrence end time is changed
accordingly: When a poll message already exists, we simply update the
endtime, but when it no longer exists, i. e. when it's deleted
speculatively after a transfer request, we recreate one using the
history entry id that resulted in its creation (e. g. cancelled or rejected
transfer).
This should fix b/240984498. Though the exact reason for that bug is
still unclear to me. Namely, it throws an NPE at this line during an
explicit domain transfer approval:
https://cs.opensource.google/nomulus/nomulus/+/master:core/src/main/java/google/registry/flows/domain/DomainFlowUtils.java;l=603;bpv=1;bpt=0;drc=ede919d7dcdb7f209b074563b3d449ebee19118a
The domain in question has a null autorenewPollMessageHistoryId, but
that in itself should not have caused an NPE because we are not
operating on the null pointer. On that line the only possible way to
throw an NPE is for the domain itself to be null, but if that were the
case, the NPE would have been thrown at line 599 where we called a
method on the domain object.
Regardless of the cause, with this PR we are using an explicitly
provided history id and checking for its nullness before using it. If a
similar issue arises again, we should have a better idea why.
Lastly, the way poll message id is constructed is largely simplified in
PollMessageExternalKeyConverter as a result of the removal ofy parent
keys in PollMessage. This does present a possibility of failure when
immediately before deployment, a registrar requests a poll message and
received the old id, but by the time the registrar acks the id, the new
version is deployed and therefore does not recognize the old key. The
likelihood of this happening should be slim, and we could have prevented
it by letting the converter recognize both the old and the new key.
However, we would like to eventually phase out the old key, and in
theory a registrar could ack a poll message at any time after it was
requested. So, there is not a safe time by which all the old ids are
acked, lest we develop some elaborate scheme to keep track of which
messages were sent with an old id when requested and which of these old
ids are acked. Only then can we be truly safe to phase out the old id.
The benefit does not seem to warrant the effort. If a registrar does
encounter a situation like this, they could open a support bug to have
us manually ack the poll message for them.
Basically, what's happening here is that some flow tests are adding
things to the claims list cache which is stored statically, meaning that
some other tests can pick those up when they shouldn't. By adding the
extension in RFTC, it'll clear out the caches after each test.
* Allow anchor tenant creation via allocation token behavior
This also enforces that non-superusers cannot create registrations on
trademarked names prior to the sunrise period, even if they have an
allocation token with ANCHOR_TENANT behavior.
* Create a registry lock permission and corresponding account manager role
This allows us to distinguish between standard account managers and
users that might have the registry lock permission. This will make the
registry lock password-setting flow easier (user can reset their
password iff they have the REGISTRY_LOCK permission, instead of having a
separate boolean) and allows us to easily determine whether or not a
user should have access to registry lock views in the UI.
We cache the ClaimsList Java object for six hours (we don't expect it to
change frequently, and the cron job to update it only runs every twelve
hours). Subsequent calls to ClaimsListDao::get will return the cached
value.
Within the ClaimsList Java object itself, we cache any labels that we
have retrieved. While we already have a form of a cache here in the
"labelsToKeys" map, that only handles situations where we've loaded the
entire map from the database. We want to have a non-guaranteed cache in
order to make repeated calls to getClaimKey fast.
* Use the new renewal price logic in transfer flow
* Fix build
* Add renewal handling on all transfer flows
* Merge branch 'master' into transfer-retain-renewal-price
* Merge branch 'master' into transfer-retain-renewal-price
* Add more tests
* Add base object classes for new user/role permissioning model
- Adds the permissions themselves
- Adds the six roles that a user may have -- three global, three
per-registrar
- Adds the mapping from role -> set of permissions
- Adds a UserRoles object to encapsulate the answer to the question of
"does this user have this permission?"
- Adds a User class as a base to show how we will use the new UserRoles
object
longer included in the generated WAR, even though the deploy_jar
configuration is specified as a dependency.
I could not figure out a way to tweak the configuration dependency to
have core.jar pulled into the .war, so I decided to just explicitly pick
it from its known location.
TESTED=deployed to alpha and verified that the instances can start.
* Rename DomainContent -> DomainBase
This is a follow-up to PR #1725 which renamed DomainBase to Domain. Now, the
class naming hierarchy has the same structure as ContactBase/HostBase.
* Rename DomainBase -> Domain
This was a long time coming, but we couldn't do it until we left Datastore, as
the Java class name has to match the Datastore entity name.
Subsequent PRs will rename ContactResource to Contact and HostResource to Host,
so that everything matches the SQL table names (and is shorter!).
* Merge branch 'master' into rename-domainbase
The fix is based on b/240627423. I tested locally and was able to build
with the -PenableCrossReferencing=true flag successfully.
TESTED=run the kythe GCB pipeline locally.
This PR turns out to be more massive than I would have liked but it
deals with all billing event related stuff, which are more or link all
intertwined:
* Remove all billing events as Ofy entities.
* Add a temporary annotation to allow BillingEvent's ID to be
auto-allocated by ofy while not lacking the Ofy @Id annotation.
* Remove Modification, which is only used in ofy.
* Remove BillingVKey, as we do not need to store the ofy key parent
information anymore. The VKey for a billing event now just contain
its primary key, and can be converted by VKeyConverter.
* Remove BigQuery related code in the billing pipeline.
Note that after BillingVKey is removed, several columns in
BillingCancellation are no longer needed. The change to database schema
will be handled in https://github.com/google/nomulus/pull/1721 after
this PR is deployed to production.
<!-- Reviewable:start -->
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1710)
<!-- Reviewable:end -->
* Revert "Upgrade App Engine Standard to Java 17 w/ bundled APIs (#1714)"
This partially reverts commit d8e77e2ab2 (it keeps
intact unrelated version upgrades).
We need to temporarily revert this because Spinnaker isn't quite yet playing
nice with the new <app-engine-apis> configuration option in appengine-web.xml
(it seems like this was added recently and Spinnaker is still stuck on App
Engine SDK version 1.9.82 which predates it). Hopefully we can get that
dependency updated in Spinnaker soon and then we can re-upgrade to Java 17.
* Flyway files for adding allocationToken column to Domain table
* Rename column to current_package_token
* Update er diagram
* Add foreign key to DomainHistory
This shouldn't matter for billing or anything like that because the
actual actions performed that month are still correct, but before this
PR we're including all domains ever created in the total_domains number,
including deleted domains
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1713)
<!-- Reviewable:end -->
* Upgrade App Engine Standard to Java 17 w/ bundled APIs
Note that this doesn't yet upgrade our actual Gradle scripts to use a more
recent of Java (that will happen separately); this solely affects the GAE
instances.
I followed the instructions here:
https://cloud.google.com/appengine/docs/standard/java-gen2/services/access
And note that I removed threadsafe true from appengine's XML config because
that doesn't do anything anymore and was just throwing errors (the new
instances handle multiple requests in parallel by default, no configuration
necessary).
* Convert to gradle 7.
* More fixes, regenerated lockfiles.
* Update lockfiles for dependency update.
* Fix show_upgrade_diff for new lockfile format
* Add property for allowInsecureProtocol
Allow us to override the restriction against use of plain HTTP for
communication to dependency repositories. We need this to be able to use a
local proxy for dependency gathering.
* Checking in missing gradle.lockfile
* Split failing dns update batches and kill after 10 retries
* format fixes
* Add another test
* Switch to CloudTasks
* Change back to app engine header
* Change to immutableList and other changes
* Change to optional header
* Add bug ID to todo
* Switch to constructor injection
* Remove old queue
* Set response status
* Change to Optional<Integer>
* Rename action status
* Switched to use CLoudTaskHelper
* Remove spy in test
This is, as of now, unused but we can use it for b/237683906 and
b/237800445 in the future to allow for special behavior dictated by
allocation tokens rather than having to reserve specific domains.
Note that we enforce a tied domain for ANCHOR_TENANT tokens (because
they should be matched to a domain) but not for BYPASS_TLD_STATE tokens.
This deletes LastSqlTransaction and ReplayGap in Datastore, as well as
SqlReplayCheckpoint and TransactionEntity in SQL. These are all related
to replay and are no longer used.
* Remove columns for unused ofy key reconstitution
Remove the columns in Domain, DomainHistory, GracePeriod and
GracePeriodHistory that were only used for Ofy key reconstitution.
All uses of these were removed in #1660.
* Add forgotten flyway file.
* Re-add database migration state commands
These were removed in PR #1661, but we do still need them for the time being
until we complete the ID migration as well.
This uses the Apache commons CSV parsing instead of rolling our own.
Annoyingly, the results that we're given aren't exactly proper CSVs
since they have a non-standard line of data at the top, and the header
is actually the second line.
Fortunately this no longer requires a log-in, we can just send a GET
request and receive a CSV result in return.
This also adds the apache-commons CSV parser to the dependencies
See https://b.corp.google.com/issues/237784559 for more details
* Resolve conflict
* Fix setup for existing test cases in info and check flow
* Revise info flow test cases
* Fix lint
* Merge branch 'master' into handlefeerequest-renew
* Address code review comments myself
* Merge branch 'master' into handlefeerequest-renew
* Get test passing
* Add check flow tests
* Format, consolidate test helpers
* Don't unnecessarily specify XML name
Now that ofy keys aren't necessarily being restored, it seemed prudent to
audit existing uses to verify that we aren't relying on any keys that may now
be null.
This fixes one case that appeared to be potentially problematic (in
ResourceFlowUtils), removes a few methods that call getOfyKey() but are no
longer used, and adds comments to one use of the key that appears to be safe
after visual inspection.
This includes:
- deletion of helper DB methods in tests
- deletion of various old Datastore-only classes and removal of any
endpoints
- removal of the dual-database test concept
- removal of 'ofy' from the AppEngineExtension
This removes the code that converts between ofy fields and SQL fields in DomainContent and a number of related core classes (basically anything that also needed modification to support the removal from DomainContent).
Cursor was originally envisioned to support arbitary ImmutableObject
scopes. However, in practice only the Registry scope is used. The SQL
representation of Cursor assumes that and the schema uses a composite ID
with a string column for the primary key of the scope object. Without a
schema migration to persist the VKey of the scope, we cannot support any
ImmutableObject other than those with a primitive string primary key.
Given the complexity involved and the limited use case, the scope is now
explictly limited to Registry only.
Also removed mapreduces that depends on Ofy keys of Cursors, and made
some code quality improvement based on IntelliJ suggestions on modified
files.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1672)
<!-- Reviewable:end -->
These tests use Ofy exclusively and should not run anymore, as any class
they test also use Ofy and should be deleted.
More importantly, running tests in Ofy mode makes it hard to remove Ofy
from entities, especially Registrar and RegistrarContact, as some of
them are created as canned data when tests are initiated, and the
creation would fail if they are not registered as Ofy entities.
It is therefore a prerequisite to disable these tests before we can
remove Ofy from those entities. We could have deleted them, but I think
that should be done when the corresponding classes tested by them are
deleted.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1679)
<!-- Reviewable:end -->
GPG1 is deprecated and stuck in v1.4 from 2018. GPG2 is recommended. We
only use the GPG binary in tests and when the host system has both
versions it causes problems because we hardcode the GPG import command
in GpgSystemCommandExension to use the binary named "gpg", which could
be linked to either GPG1 or GPG2, causing the other test to fail when
the version of GPG that runs in tests is incompatible with the version of GPG
that imports the keys.
With this PR we only support GPG2 from now on.
Added methods that exist in AppEngineExtension that preload some canned
data. This data is loaded by default and a lot of tests rely on them. As
we migrate away from App Engine, it is helpful to have them in the JPA
test extension which will replace the app engine extension that is
used to set up the database in dual database tests.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1673)
<!-- Reviewable:end -->
I'm not 100% sure that this is strictly necessary, but for now we can
replicate the ability to generate zonefiles for any point in time in the
recent past.
* Migrate ReadDnsQueueAction to use CloudTasksUtils
Also marked TaskQueueUtils as deprecated and fixed a few linter errors.
Note that DNS pull queue still requires the use of the GAE Task Queue API.
* Fix a test failure
* Remove TaskQueueUtils from VKeyTest
* Remove the @error exception that was inadvertently pulled in
One of the more significant changes introduced in this PR is that we use
SQL as the backing database in all tests unless otherwise specified,
e.g. by using the TmOverrideExtension. We change various ofy-related
tests to use this.
This includes various changes:
- Deletion of SqlEntity/DatastoreEntity and related classes. Includes
any necessary changes because of that (e.g. getting a nice SQL key on
error in RegistryJpaIO).
- Deletion of classes that used libraries from the init-sql code
(RefreshDnsOnHostRenameAction)
- Removal of the JpaTransactionManager's backup implementation
- Modification of RegistryJpaWriteTest to not use init-sql code
- Removal of the Transaction class and related classes, however it does
not remove the TransactionEntity class as that would require DB
changes
- Removal of anything related to the actual usage of the database
migration schedule or read-only phases
- Various test changes and fixes to account for the differences in SQL
(like how foreign keys need to exist)
This deliberately doesn't do anything to alter the objects actually
stored in the DB yet, just how we use them
This is the only user of the ofy code that will stick around at least
until we move to the new registrar console. By removing references to
the transaction manager, we will be able to delete all the tm code
without interfering with this.
* Remove version pin for java-diff-utils dependency
Latest version of the lib introduces a small behavior change/bug fix.
It no longer ignores empty lines. This actually makes sense.
Update the test data to reflect this change.
The main purpose of this PR is to help debug b/234189023, where a
registrar reported that in sandbox they observed seemingly successful EPP
update responses to delete NS records, which are not actually deleted after
the commands executed.
To actually load the persisted domain resource after an update would
require us to execute another transaction immediately after the update
transaction and that can only be achieved outside the flow (i. e. in
FlowRunner or EppController) and we need to test for the type of flows
before logging, which seems unnecessarily complex.
For now we are just adding logs inside the update transaction itself to
validate that:
1. The NS records to delete are as expected.
2. The Current NS records are as expected.
3. The new NS records to persist are as expected.
The EPP success reply is the default reply when no errors are thrown in
a transaction. If we see a success reply (which means that the
transaction finished successfully) and expected logs from the transaction, the
only explanation could be that somewhere in the ORM layer the java
representation of what the entity is is different from what is being
presented to the database. I think that signals a much bigger and
fundamental problem, which is quite unlikely given how isolated the
issue under consideration is.
In any case we would like to add the logging functionality in sandbox and ask
the registrar to report again when they see similar issues.
Also made some typo and linting fixes.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1663)
<!-- Reviewable:end -->
* Fix some small transactional issues in SQL mode
These weren't caught until I switched the default database type in tests
to be SQL (separate PR). Fortunately these don't seem to be catastrophic
This includes:
- removing the actions that do the replay
- removing the tests for the replay
- removing the ReplayExtension and adjusting the various tests that used
it appropriately
- removing functionality relating to "things that happen during replay",
e.g. beforeSqlSaveOnReplay
This does not include:
- removing the InitSqlPipeline or similar tasks
- removing e.g. SqlEntity (it's used in other places)
- removing Transforms/RegistryJpaIO and other SQL-pipeline-creation code
* Remove bracket around varname in CloudBuild script
Due to spinnaker restriction: it cannot handle variable references where the var name has brackets around it.
Added spinnaker error message to the comments
This included removing ofy-specific code from various tests. Also, some
of the other tests (e.g. RdapDomainActionTest) had to be configured to
use only SQL -- otherwise, as it currently stands, they were trying to
use ofy.
We also delete the CreateSyntheticHistoryEntriesAction and pipeline
because they're no longer relevant, and impossible to test (the goal of
the actions were to create objects in ofy which doesn't happen any
more).
We're running into issues pulling 2.1.3 from maven, possibly due to
vulnerabilities in dependencies, so this updates it to the most recent
version of 2.2.6.
* Add a new recurrenceLastExpansion column to the BillingRecurrence table
This will be used to determine which recurrences are in scope for
expansion. Anything that has already been expanded within the past year is
automatically out of scope.
Note that the default value is set to just over a year ago, so that, initially,
everything is in scope for expansion, and then will gradually be winnowed down
over time so that most recurrences at any given point are out of scope. Newly
created recurrings (after the subsequent code change goes in) will have their
last expansion time set to the same as the event time of when the recurring is
written, such that they'll first be considered in-scope precisely one year
later.
* Summarize schema related tests
Document existing schema-related tests including presubmit tests and
the schema-verify predeployment test newly added to Spinnaker.
* Inject a DomainPricingLogic into ExpandRecurringBillingEventsAction
This will be used in other PRs to set the renewal price correctly based on the
renewal price behavior of the BillingRecurrence event.
Note that, in order for this to work, a not-null constraint has been lifted on
the EPP flow state when the DomainPricingCustomLogic is being constructed, as
the pricing here will occur in a backend action outside the context of any EPP
flow.
* Slightly improve performance of ExpandRecurringBillingEventsAction
We don't need to log every single no-op batch of 50 Recurrences that are
processed (considering we have 1.5M total in our system), and we also don't need
to process Recurrences that already ended prior to the Cursor time (this gets us
down to 420k from 1.5M).
* Disable Ofy tests.
This change just turns off the Ofy tests at the root, by removing processing
for dual tests and disassociating the TestOfyOnly annotation from test
annotations.
This is far less comprehensive than #1631, but it's probably worth entering as
a stopgap solution just because it should speed up our test runs and unblock a
lot of other cleanup work.
* Fix DualDatabaseTestInvocationContextProviderTest
* Optimize RDAP entity event query
For each EPP entity, directly load the latest HistoryEntry per event type
instead of loading all events through the HistoryEntryDao.
Although most entities have a small number of history entries, there are
a few entities with many entries, enough to cause OutOfMemory error.
We have backend max-instances set to 100, which apparently exceeds the default
quota for GAE. Add info on updating the quota or changing this parameter to
the configuration doc.
* Add batching to ExpandRecurringBillingEventsAction
It's OOMing on trying to load every single BillingRecurrence that needs to be
expanded simultaneously (which is to be expected). So this processes them in
transactional batches of 50.
* Cleanup gpg-agent instances and home directories
The GpgSystemCommandException leaks home directories, but more importantly it
leaks gpg-agent instances. This can cause problems with inotify limits, since
the agent seems to make use of inotify. Do a proper cleanup in afterEach().
* Don't fail if we can't kill the agent
We'll delete the associated code soon enough too, but it's safer to delete the
cron jobs first and run in that state for a week, so we can still run them
manually if need be.
* Reduce the number of manually scaled instances for default/pubapi
This is in the spirit of "not always running significantly over-provisioned",
which helps to save costs and also expose potential scaling issues when they are
still small rather than all at once when they're a big problem.
This can always be reverted if necessary, and can be instantaneously adjusted by
running the `nomulus set_num_instances` command.
* Don't enforce billing account map check on TEST TLDs
This was affecting monitoring (i.e. prober TLDs). Note that test TLDs are
already excluded from the billing account map check in the Registrar builder()
method (see PR #1601), but we forgot to make that same test TLD exclusion in the
EPP flows check (see PR #1605).
* Add test for Java 8 Compatibility
Add a test to check for Java 8 compatibility of jars deployed to
AppEngine.
It is not enough to run existing tests with Java 8 VM, since many API
jars are not exercised by tests. For example, those for GCP services
like the SecretManager.
We take the conservative approach and verify that every class in every
jar are compiled for Java 8.
We would like to re-use the build cache when building RCs for different
environments. There's not much practical use in doing a "clean" for
every build when Gradle should be able to figure out which artifacts
need to be rebuilt. It also does not make sense to build each
environment in a separate step, which also introduces redunency because
not all artifacts are cached across steps. The build cache is enabled by
default.
Lastly, the cache needs to be inside the /workspace folder, which is the
default persisted storage location.
TESTED=tried to build the RCs on alpha and saved about 10 min.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1619)
<!-- Reviewable:end -->
* Add missing transaction for whois lookups
Nameserver whois lookups are failing under SQL for hosts with superordinate
domains because the query in this case is not done in a transaction. We
missed this during testing because a) we didn't have a test for lookups of
hosts with superordinate domains and b) we missed converting
NameserverWhoisResponseTest to a DualDatabaseTest.
This PR fixes the problem and adds the requisite testing.
* Use a single transaction to get host registrars
* Replace streaming with Maps.toMap()
* Downgrade dependencies that no longer support Java8
Downgrade two dependencies whose latest versions no longer support
java8.
A follow up PR will add java8 compatibility to presubmit tests.
We have recently started to routinely breach the 1h timeout. Increasing
this value to 2h. We should also look into reusing the artifacts when
building RCs for different environments.
* Use Gradle dependency dynamic versioning
Use dynamic versioning for Gradle dependencies when possible.
Please refer to go/dr-dependency-upgrade for more information about the
automation plan.
This PR calls out all dependencies that must be pinned to specific
versions for various reasons. The remaining ones are converted to
open-ended version ranges ("[version_str,)").
* Check PAK on domain create
* Add unit test
* update docs
* Remove unneccesary setup
* Fix blank line
* Add check and test to all relevant flows
* Change error message
* Ignore version UIDs during txn deserialization
When deserializing transactions for replay to datastore, ignore class version
UIDs that don't match those of the local classes and just use the local class
descriptors instead. This is a simple solution for the problem of persisted
VKeys containing references to classes where the class has been updated and
the serial version UID has changed.
Also add a "replay_txns" command that replays the transactions from a given
start point so we can verify all transactions are deserializable.
TESTED:
Ran replay_txns against all transactions on sandbox beginning with
transaction id 1828385, which includes Recurring billing events containing
both the old and the new serial version UIDs.
* Change check for root directory during rollback
`rollback_tool` tries to infer the root of the nomulus tree by checking for a
directory named "nomulus". This is potentially problematic (and, indeed, was
for me) since there is no guarantee what that directory will be named.
There are a number of features that characterize the root directory. Check
for the presence of the `rollback_tool` wrapper script, as this is both at
root level and tightly coupled to the python code, so hopefully we won't
move it without testing that the script still works.
This will require edits to a substantial number of registrars on sandbox (nearly
all of them) because almost all of them have access to at least one TLD, but
almost none of them have any billing accounts set. Until this is set, any updates
to the existing registrars that aren't adding the billing accounts will cause
failures.
Unfortunately, there wasn't any less invasive foolproof way to implement this
change, and we already had one attempt to implement it on create registrar
command that wasn't working (because allowed TLDs tend not to be added on
initial registrar creation, but rather, afterwards as an update).
* Downgrade Caffeine to 2.9.3
Apparently Caffeine >=3.* requires Java 11, and we're still stuck on Java 8
because of App Engine Standard. Fortunately this doesn't affect the exposed
interface we're using, so we can simply go back to the newest Caffeine version
once Registry 3.0 Phase 3 (GKE migration) is completed.
Now that SQL is the default, we do not need this side job to run
alongside the main one. Its purpose was to validate the BEAM pipeline
while Datastore was primary.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1599)
<!-- Reviewable:end -->
* Create a Dataflow pipeline to resave EPP resources
This has two modes.
If `fast` is false, then we will just load all EPP resources, project them to the current time, and save them.
If `fast` is true, we will attempt to intelligently load and save only resources that we expect to have changes applied when we project them to the current time. This means resources with pending transfers that have expired, domains with expired grace periods, and non-deleted domains that have expired (we expect that they autorenewed).
For some inexplicable reason, the RDE beam pipeline in both sandbox and
production has been broken for the past week or so. Our investigations
revealed that during the CoGropuByKey stage, some repo ID -> revision ID
pairs were duplicated. This may be a problem with the Dataflow runtime
which somehow introduced the duplicate during reshuffling.
This PR attempts to fix the symptom only by deduping the revision IDs. We
will do some more investigation and possibly follow up with the Dataflow
team if we determine it is an upstream issue.
TESTED=deployed the pipeline and successfully run sandbox RDE with it.
* Begin migration from Guava Cache to Caffeine
Caffeine is apparently strictly superior to the older Guava Cache (and is even
recommended in lieu of Guava Cache on Guava Cache's own documentation).
This adds the relevant dependencies and switch over just a single call site to
use the new Caffeine cache. It also implements a new pattern, asynchronously
refreshing the cache value starting from half of our configuration time. For
frequently accessed entities this will allow us to NEVER block on a load, as it
will be asynchronously refreshed in the background long before it ever expires
synchronously during a read operation.
* Add new columns to BillingEvent.java
* Improve PR and modifyJodaMoneyType to handle null currency in override
* Add test cases for edge cases of nullSafeGet in JodaMoneyType
* Improve assertions
* Remove dos.xml from the configs
We don't have dos config right now, and applying dos from "gcloud app
deploy" is deprecated and has started causing problems.
If we add dos configs, it should be using "gcloud app firewall-rules".
* Build Java8-compatible release
Use the new options.release Gradle property to make sure builds are
compatible with Java 8, which is the runtime on Appengine.
This new property replaces sourceCompatibility, targetCompatibility, and
bootclasspath (wasn't previously set, which is the reason why we
couldn't detect Java9 api usage when building).
* Ignore read-only when saving commit logs
Ignore read-only when saving commit logs and commit log mutations so that we
can safely replicate in read-only mode. This should be safe, as we only ever
to the situation of saving commit logs and mutations when something has
already actually been modified in a transaction, meaning that we should have hit
the "read only" sentinel already.
This also introduces the ability to set the Clock in the
TransactionManagerFactory so that we can test this functionality.
* Changes per review
* Fix issues affecting tests
- Restore clobbered async phase in testNoInMigrationState_doesNothing
- Restore system clock to TransactionManagerFactory to avoid affecting other
tests.
* Change billingIdentifier to BillingAccountMap in invoicing pipeline
* Add a default for billing account map
* Throw error on missing PAK
* Add unit test
Check for a PSQLException referencing a failed connection to "google:5433",
which likely indicates that there is another nomulus tool instance running.
It's worth giving this hint because in cases like this it's not at all obvious
that the other instance of nomulus is problematic.
* Add a no-async actions DB migration phase
This needs to be set several hours prior to entering the READONLY stage. This is
not a read-only stage; all synchronous actions under Datastore (such as domain
creates) will continue to succeed. The only thing that will fail is host
deletes, host renames, and contact deletes, as these three actions require a
mapreduce to run before they are complete, and we don't want mapreduces hanging
around and executing during what is supposed to be a short duration READONLY
period.
* Use UrlFetch for RDE and default TLS (1.2) for other URL connections
This removes the TLS 1.3-settings in the module providers and,
essentially, reverts the changes in #1535 only to the RdeReporter and
RdeReportActionTest
We have a cron job that runs the RDE upload action every 4 hours for all
TLD. Normally this should be a no-op beacuse a RDE upload is scheduled
after RDE staging is completed, and when it fails with non-2XX status it
will retry. However if for some reason it failed due to 20X status (like
waiting for the SFTP cursor), it will not retry but rely on the cron job to
catch up.
With the BEAM RDE pipeline every staging job saves all its deposits in a
uniquely named folder to avoid the need to use a lock, which is not
practical in BEAM. However the cron job has no way of knowing what the
prefixes are for each TLD so it will fail in SQL mode.
In this PR we implemented a logic to guess what the prefix should be and
use it, if we are in SQL mode and a prefix is not provided.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1574)
<!-- Reviewable:end -->
* Set the initial worker count for the RDE beam pipeline at 24
This likely will speed up the pipeline by skipping the initially slow
process of spinning up instances.
* Fix sporadic SQL Snapshot failure
The Postgresql set-snapshot statement (called in
JpaTransactionManager.setDatabaseSnapshot() method) must be the first
statement in the SQL transaction.
Currenty the JpaTransaction.transact() method may insert a query for
DatabaseMigrationStateSchedule before the user query when the cache is
empty or the cached value expires.
This PR proactively preloads the cache in RegistryJpaIO to prevent cache
loading inside the transaction.
This PR also changes some DatabaseSnapshotTest tests to be retrying, in
case they run just after the cache expires. (This has happened before in
CI).
The format check script currently outputs "true" if there were files that need
reformatting and "false" if not, which is useful for gradle but less so for
other applications (notably commit hooks). Terminate with an exit code of 1
if the format check fails.
TESTED: Tried this from both a pre-commit hook and from the gradle build.
* Add a "list_txns" to dump Transaction table
Add the list_txns command which can dump the entire contents of the
Transaction table, either in csv format or as human readable transactions.
The CSV format is useful for storing the transaction table at a specific point
in time for later reference without requiring us to repeatedly hit the
replica.
Creating this without tests because this command has a very short shelf-life
and is really only intended to be run by developers. Tested all features
locally.
* Reformatted
* Ignore trivial differences when comparing DB
Some data difference are due to entity model differences and also
harmless. We should igore them when comparing Datastore and SQL.
This PR ignores the following diffs:
- null vs empty collection
- the empty string in Address.stree field, which is a list
* Bump flogger and beam dependency versions
Beam 2.34.0 -> 2.37.0
Flogger 0.7.3 -> 0.7.4
Intellij keeps getting confused about which version of Flogger we're
bringing in. Even though we had previously locked Flogger to 0.7.3, for
some reason it was still bringing in the Beam transitive dependency of
0.6.0 which was causing the a bunch of class initialization errors.
Bumping Beam to 2.34.0 bumps the transitive dependency to 0.7.4 so we
can always use that.
1. testRun_withPrefix() in RdeUploadActionTest does calls a mock lock
handler and does not actually try to read from the fake GCS
implementation. Therefore there's no point settig it up.
2. Remove an unused field in UploadDatastoreBackupActionTest.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1563)
<!-- Reviewable:end -->
* Remove static methods in back up actions
* Remove BigqueryPollJob helper class
* Add schedule time in task comparison
* Change payload type from byte[] to ByteString
* Fix a subtle issue in BRDA copy caused by Cloud Tasks
After the Cloud Tasks migration and #1508, the BRDA copy job now
routinely fail on the first try because the revision update is not
commited by the time the Cloud Tasks job enqueued in the same
transaction runs for the first time. This is because the enqueueing is
a side effect and not part of the transaction. The job eventually
succeeds because of retries.
This PR attempts to mitigate the initial failure by adding a delay to
the enqueued job, and checking the cursor in the job itself to prevent
it from running before the transaction is commited.
* Fix issues with saving and deleting gap records
Datastore limits us to mutating up to 25 records per transaction. We
sometimes exceed that when deleting expired gap records. In addition, it is
theoretically possible for us to accumulate enough continuous gap records to
exceed this count while replaying the original transaction.
Deal with deletion by breaking up the gap records to be deleted into a batch
size that is small enough to be deleted transactionally (in practice, we don't
much care about the transactionality but it doesn't seem like we can delete
batches without it).
Deal with the possibility of too many additions by always breaking out gap
record storage and last transaction number updates into their own
transaction(s) (separate from the replay of the original SQL transaction).
These are useful for the purposes of filtering by one-time/multi-use tokens, and
for determining which one-time tokens have been used (and if so, for which
domain).
* Track and replay Transaction table gaps
Id gaps in the Transaction table can be the result of a transactions committed
out of order. To deal with this, keep track of gaps for up to five minutes
and check to see if they've been back-filled prior to applying the next batch
of transactions during reply.
* Changes for review
* Calculate gap expiration time before gap queries
* Reformat.
* Add 3 more SQL indexes to the Host table
These indexes on creationTime, deletionTime, and currentSponsorRegistrarId are
present on the other two EPP resource tables (Domain and Contact), and are
useful for a wide variety of operations/analytics queries.
* Improve cache loading in Registries.java
The loader for the TLD cache in Registries.java unnecessarily reads from
another cache when running with SQL, potentially triggering additional
database access. This code runs in the whois query path, and contributes
to the high latency in sandbox.
The query analyzer identified this is a missing index on the BillingEvent table,
and I added it for recurrences and cancellations as well as it's likely to be a
problem for them too. "Give me all the billing events associated with a given
domain by its repo ID" seems like a pretty common use case for the DB (and does
appear to be used by our invoicing pipeline).
This is a follow-up to PR #1545.
These indexes were identified as missing by PostgreSQL's query analyzer in our
sandbox environment (where we get enough realistic EPP traffic to identify these
deficiencies).
Note that a lot of the new indexes being named have to use the DB representation
of the column name because they are either embedded or subclassed entities,
whereas most of the existing ones are able to simply refer to Java field names.
This is the Java schema follow-up PR to PR #1541, which is what added the
actual DB changes through Flyway scripts.
* Reorganize new schema changes
Reorganized new schema changes and make each flyway script update a
single table.
Each flyway script is executed in a single database transaction so that
the script can be rolled back in one shot. It acquires a shared lock on
all tables touched by the script. This is deadlock-prone because in a
busy database, there may be user queries that attempt to lock the same
set of tables, but in different order. By limiting each script to one
table, we avoid the problem.
We should have some a presubmit check to enforce this rule.
All changes have been deployed to Sandbox out-of-band. When doing so,
we changed all CREATE INDEX statements to CREATE INDEX IF NOT EXISTS.
Future deployments should be able to proceed normally.
* Revise Host index on inet_addresses
The index on the 'inet_addresses array column should be of gin or gist
type, which index individual array elements. We use gin for now since
host updates are not often, and gin has better accuracy.
Since flyway script V108__... has not been deployed, we edit the file
in place instead of adding a new script.
This will be followed up with a modified query that can take advantage
of the gin index. Until then we don't expect to see performance
improvement.
The suspected bottlenect query in the whois path is:
select * from "Host" where 'non-ip-string' = any(inet_address) and
deletion_time < now();
It needs to be revised into:
select * from "Host" where array['non-ip-string'] <@ inet_address and
deletion_time < now();
The combined change reduces the query time from 90ms to 30ms in Sandbox,
and from 150ms to 40ms in production.
It is unclear if this solves all problem with whois latency.
* Add deletionTime/inetAddresses indexes to Host table to support WHOIS
Weimin identified these as missing, and being the cause of slowdowns in
NameserverLookupByIpCommand that we're seeing in sandbox.
This is the first of two PRs, adding just the Flyway/schema changes. The
second PR adding the Java object model changes is #1547.
* Add domainRepoId indexes to billing events #1544
The query analyzer identified this is a missing index on the BillingEvent table,
and I added it for recurrences and cancellations as well as it's likely to be a
problem for them too. "Give me all the billing events associated with a given
domain by its repo ID" seems like a pretty common use case for the DB (and does
appear to be used by our invoicing pipeline).
This is the first of two PRs that makes just the DB changes. The second PR
(#1544) will add the Java code changes, and will be committed after this one is
deployed.
* Add 9 more indexes to SQL schema
This indexes were identified as missing by PostgreSQL's query analyzer in our
sandbox environment (where we get enough realistic EPP traffic to identify these
deficiencies).
This is the first of two PRs -- the second PR (#1540) will be merged only after
this one is live in production. Note that this is the PR that actually modifies
the database though, so once this one is deployed we will already have the
benefit of the new indexes.
- Use the standard HttpsURLConnection to write/read data
- Rewrite RdeReporter, Nordn*Action, and Marksdb classes and related
tests to conform to the new format
- Remove FakeURLFetchService and ForwardingUrlFetchService as they weren't used
- Refactor UrlFetchException to UrlConnectionException
- Refactor UrlFetchUtils to UrlConnectionUtils
I will need to test this on Alpha. Fortunately the connections that
don't require auth (e.g. TMDB downloading) should be testable.
* Allow replicateToDatastore to skip gaps
As it turns out, gaps in the transaction id sequence number are expected
because rollbacks do not rollback sequence numbers.
To deal with this, stop checking these.
This change is not adequate in and of itself, as it is possible for a gap to
be introduced if two transactions are committed out of order of their sequence
number. We are currently discussing several strategies to mitigate this.
* Remove println, add a record verification
* Fix hanging test
Tests using the TestServerExtension may hang forever if an underlying
component (e.g., testcontainer for psql) fails. This may be the cause
of the some kokoro runs that timeed out after three hours.
* Add a tools command to launch SQL validation job
Stopping using Pipeline.run().waitUntilFinish in
ValidateDatastorePipeline. Flex-templalate does not support blocking
wait in the main thread.
This PR adds a new ValidateSqlCommand that launches the pipeline and
maintains the SQL snapshot while the pipeline is running.
This PR also added more parameters to both ValidateSqlCommand and
ValidateDatastoreCommand:
- The -c option to supply an optional incremental comparison start time
- The -r option to supply an optional release tag that is not 'live',
e.g., nomulus-DDDDYYMM-RC00
If the manual launch option (-m) is enabled, the commands will print the
gcloud command that can launch the pipeline.
Tested with sandbox, qa and the dev project.
* Don't reset the update time for TLD updates
It turns out that the reason that the Registrar update timestamp isn't updated
for some of the tests is because the record is updated unchanged. We can
avoid this problem by not trying to update the registrar to the same value.
So in this case, if the registrar alreay contains the TLD we're adding, don't
try to add it.
* Fix entity delete replication, compare db @ replay
Replay tests currently only verify that the contents of a transaction are
can be successfully replicated to the other database. They do not verify that
the contents of both databases are equivalent. As a result, we miss any
changes omitted from the transaction (as was the case with entity deletions).
This change adds a final database comparison to ReplayExtension so we can
safely say that the databases are in the same state.
This comparison is introduced in part as a unit test for the one-line fix for
replication of an "entity delete" operation (where we delete using an entity
object instead of the object's key) which so far has only affected PollMessage
deletion. The fix is also included in this commit within
JpaTransactionManagerImpl.
* Exclude tests and entities with failing comparisons
* Get all tests to pass and fix more timestamp
Fix most of the unit tests that were broken by this change.
- Fix timestamp updates after grace period changes in DomainContent and for
TLD changes in Registrar.
- Reenable full database comparison for most DomainCreateFlowTest's.
- Make some test entities NonReplicated so they don't break when used with
jpaTm().delete()
- Diable checking of a few more entity types that are failing comparisons.
- Add some formatting fixes.
* Remove unnecessary "NoDatabaseCompare"
I turns out that after other fixes/elisions we no longer need these for
any tests in DomainCreateFlowTest.
* Changes for review
* Remove old "compare" flag.
* Reformatted.
* Make a few quality-of-life improvements in CloudTasksUtils
1. Update the method names. There are too many overloaded methods and it
is hard to figure out which one does which without checking the
javadoc.
2. Added a method in the task matcher to specify the delay time in
DateTime, so the caller does not need to convert it to Timestamp.
3. Remove the expilict dependency on a clock when enqueueing a task with
delay, the clock is now injected directly into the util instance
itself.
* Disable prober data deletion cron job in prod & sandbox
This is going to unnecessarily make the database migration more complex, and we
don't need them that badly. We'll re-enable these cron jobs once we've written
the new version of this action that handles Cloud SQL correctly (the current
version only does Datastore anyway).
* Ignore prober data when comparing databases
Completely ignore prober data when comparing Datastore and SQL.
Prober data deletions are not propagated from Datastore to SQL. It is
difficult to distinguish soft-deletes from normal updates, therefore
difficult to avoid false positives when looking for differences.
This is necessary because the Cloud Tasks API is not transactionally enrolled,
so it's possible that multiple tasks might end up being enqueued. We need to be
able to handle them.
* Fix update timestamps for DomainContent types
We expect update timestamps to be updated whenever a containing entity is
modified and persisted, but unfortunately Hibernate doesn't seem to do this --
instead it appears to regard such an entity as unchanged.
To work around this, we explicitly reset the update timestamp whenever a
nested collection is modified in the Builder.
Note that this change only solves the problem for DomainContent. All other
entitities containing UpdateAutoTimestamp will need to be audited and
instrumented with a similar change.
* Fix a handful of tests broken by this change
* Reformatted.
* Use CloudTaskUtils to enqueue
* Add CloudTasksUtilsModule to FrontendComponent
* Fix Uri query issue
* Remove header and check service in matcher
* Use a ThreadLocal boolean in TestServer to determine enqueueing
* Extract enqueuing and email sending from tm().transact()
* Add action to DB comparison pipeline
Add a backend Action in Nomulus server that lanuches the pipeline for
comparing datastore (secondary) with Cloud SQL (primary).
* Save progress
* Revert test changes
* Add pipeline launching
* Fix create/update timestamp replay problems
When CreateAutoTimestamp and UpdateAutoTimestamp are inserted into a
Transaction, their values are not populated in the same way as when they are
stored in the course of an SQL commit. This results in different timestamp
values between SQL and datastore during the SQL -> DS replay.
Fix this by providing these values from the JPA transaction time when we're
doing transaction serialization.
This change also removes the initialization of the Ofy clock in
ExpandRecurringBillingEventsActionTest. It's not necessary as the
ReplayExtension already takes care of this and doing it after the
ReplayExtension as we were breaks a test now that the update timestamps are
correct.
* Remove ReportingUtils and use CloudTaskUtil to enqueue
* Use schedule time helper to enqueue and update schedule time comparison
* Fix comment, indentation in gradle file and improve time comparison
* Change from TaskQueueUtils to CloudTasksUtils in LoadTestAction
* Put X_CSRF_TOKEN in task headers
* Fix schedule time and gradle issue
* Remove TaskQueue constant dependency
* Double run seconds
* Add comment for X_CSRF_TOKEN
* Fix flaky RdeStagingActionDatastoreTest
Fixed the most common cause that makes one method flaky (Clock and
timestamp problem). Added a TODO to rethink test case.
Also added notes on tasks potentially enqueued multiple times.
The gcloud command does some weird stuff with sorting when custom format
is used. Here we instead rely on linux sort and head command to sort the
versions list.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1511)
<!-- Reviewable:end -->
* Add an index on Host.host_name column
This field is queried during host creation and needs an index to speed
up the query.
Since Hibernate does not explicitly refer to indexes, we can change the
code and schema in one PR.
* Use the built-in replicaJpaTm() in RDAP
This includes a test for the replica-simulating transaction manager and
removal of any replica-specific code in RDAP tests, because it's
unnecessary due to the existing tests.
* Add DS validation to match Cloud DNS
* Add checks to flows
* Add some flow tests
* Add tests for DomainCreateFlow
* Add tests for UpdateDomainCommand
* Fix docs test
* Small fixes
* Remove builder from tests
Standard mode will determine the watermarks based on the cursors and
kick off subsequent uploading steps. In order to run both the Beam and
the Mapreduce pipeline in parallel, we need to allow setting the beam
parameter when in standard mode. This changes should have been part of
https://github.com/google/nomulus/pull/1500.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1505)
<!-- Reviewable:end -->
The cached methods are only used in situations where we don't really
care about being 100% synchronously up to date (e.g. whois), and they're
not used frequently anyway, so it's safe to use the replica in these
locations.
* Compare migration data with SQL as primary DB
Add a BEAM pipeline that compares the secondary Datastore against SQL.
This is a dumb pipeline to be launched by a driver (in a followup PR).
Manually tested pipeline in sandbox.
Also updated the ValidateSqlPipeline and the snapshot finder class so
that an appropriate Datastore export is found (one that ends before the
replay checkpoint value).
We can use it more places later but this can serve as a template. We
should inject the connection to the read-only replica (only created
once) to the constructor of the action, then use that instead of the
regular transaction manager.
We add a transaction manager that simulates the read-only-replica
behavior for testing purposes as well.
In addition, we set the transaction isolation level to READ COMMITTED
for this transaction manager (this is fine since we're never writing to
it). Postgres requires this for replica SQL access (it fails if we try
to use SERIALIZABLE) transactions. We didn't see this with the pipelines
before since those already had transaction isolation level overrides
* CommitLog handling code should call ofyTm
The tm() call will use JPA transaction manager after the switch-over to
SQL. These calls would lose their transaction semantics.
Both actions are to be invoked after the switchover in case we have to
switch back to Datastore as primary.
* Only compare recent changes in Datastore and SQL
When comparing Datastore and SQL, ignore older History and EPP resource
objects. This cuts the run time in half compared with a full comparison.
The intention is to run a full comparison before the switch-over from
Datastore and SQL, and run this incremental comparison during the down
time.
The incremental comparison takes about 25 minutes in production.
Performance can be improved further by filtering out older billing
events (OneTime and Cancellation). However, we don't think further
optimization is worth the effort (considering that Recurring events
cannot be filtered since they are mutable but without lastUpdateTime).
Verified in Sandbox and prod with and without time filter.
We already have ValidateEscrowDepositCommand to check for internal
reference consistency of two deposits, i. e. making sure that all
contacts and hosts referenced by domains exist in the same deposit.
Therefore to compare whether two deposits are equal we only need to make
sure that they contain the same domains and registrars, assuming they
both pass the validation. We don't compare their contents directly
because the MapReduce deposit contains all contacts and domains whereas
the Beam deposit only contains referenced ones, making a direct
comparison impossible.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1476)
<!-- Reviewable:end -->
* Speed up updating of premium lists
There are two parts to this:
1. Don't load the premium entries in the command prompt (this isn't
necessary and we didn't display that information anyway).
2. Set a proper batch size (rather than just 1) when saving all the
premium entries. This means that we generate only one INSERT statement
rather than N statements.
* Allow usage of a read-only Postgres replica
This adds the Dagger provider code for both the regular and the BEAM
environments, which are similar but not quite the same.
In addition, this demonstrates usage of the replica DB in the
RdePipeline. I tested this on alpha with a modified version of the
RdePipeline that attempts to write some dummy values to the database and
it failed with the expected message that one cannot write to a replica.
* Resolve ResaveEntityAction related conflicts
* Replace string with existing constants
* Remove solved TODOs related to ofy string to new vkey string
* Add a TODO for clean up
* Fix missing annotation
* Make not logged in errors take precedence over extension errors
This is the right order to do the checks in, because if the registrar isn't
logged in (or their login failed) then they will have an empty set of declared
extensions, so any attempt to use an extension will throw a "Service
extension(s) must be declared at login" error. This is potentially misleading
because the actual error in this situation is that the registrar isn't logged
in at all.
This also fixes some flows that weren't declared final (but should be), or
methods declared final on final classes, which is superfluous.
* Don't throw errors when existing premium list is empty
This state is possible to get into when things go wrong and it shouldn't prevent
saving new revisions of the list. Note that it will continue to throw errors if
you attempt to save a new revision that is blank (which is usually a mistake).
See http://b/211774375
* Make premium list saving run as a single transaction
This fixes the bug where the new revision is saved, but then execution gets
halted for some reason (e.g. request timeout) before the entries finish saving,
which leaves the DB in a bad state with a new top revision containing zero
entries, thus making everything standard.
* Add pending action extension to server update poll messages
This is necessary for the poll messages to contain the necessary context
explaining what domain name the relevant statuses were being added/removed
to/from.
* Always use JPA TM on Beam
Beam does not have access to datastore. Using ofy on Beam always results
in an error. Normally we should use database migration state schedule to
determine which TM to use, but on Beam there's no point in doing so. By
hard-coding the TM on beam to be SQL we can start testing features before
we migrate to SQL mode, for example the new RDE pipeline.
Also made a change to where the manual deposits are stored. It made more
sense to store them under manual/[direcitory]/[jobname]/ instead of
[jobname]/manual/[directory]/.
TESTED=deployed the pipeline on production and ran a job.
* Small fixes to show_upgrade_diffs
- fix fetch for an existing directory (we can't fetch to local "master"
branch, use "origin/master" instead).
- add a newline after "removed" entries.
This version of Beam does not have an explicit dependency on log4j.
There are a couple of other things that need to change due to the
upgrade.
1) The new version pulls in a dependency that is not on Maven Central
but on packages.confluent.io, so we need to explicitly add this repo.
2) The new version has a dependency on flogger 0.6 anb above , which removed
the LoggerConfig class (see google/flogger#142).
We therefore backported the class. In the long term we should do what
was suggested in the issue and use the normal JDK Logger config
directly.
3) The intSqlPipeline dependency graph also needs to be updated.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1472)
<!-- Reviewable:end -->
* Ignore Prober related entities when comparing db
Deletion of prober entities are not propagated to SQL, resulting in two
types of mismatches: entity only exists in SQL, or copies of an entity
differ in deleteTime. Both cases should not count as erros.
* Make ImmutableObject.toString deterministic
Remove the identity hash from the output. There is no use case
(including debugging) for it.
Removing it allows us to also remove some overriding implementations in
subclasses, and may also simplify tests.
* Improve logging/comments for commit log forks
It looks like the diff file lister is doing a second constructDiffSequence()
when a commit diff file is missing from the final sequence for purely
informational purposes. However, this purpose wasn't clear when investigating
an actual case of this.
This PR adds another warning to hopefully make the log output a bit more
useful, and also promotes the "gap" log message to a warning and adds a
comment indicating the purpose of the second constructDiffSequence().
This adds two new options:
1) An option to run RDE in lenient mode.
2) An option to run RDE with the new Beam pipeline regardless of the datastore setting.
<!-- Reviewable:start -->
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1453)
<!-- Reviewable:end -->
* Copy into PersistentSets in Domains if applicable
This is similar to https://github.com/google/nomulus/pull/1456
It is possible that in some cases we could get an exception:
Caused by: org.hibernate.HibernateException: A collection with cascade="all-delete-orphan" was no longer referenced by the owning entity instance: [parent]
The main cause of this, according to research (StackOverflow :P) is that
when Hibernate is calling the setters for these sets of children it's
losing the connection to the previously-managed child entity (which it
needs, in order to know how to delete orphans). Thus, the solution is to
maintain the same instance of the persistent set and just add/remove
to/from it as necessary.
This is complicated by the fact that sometimes the setter is given the
persistent set (the one we want to keep) and sometimes (?) it isn't.
In replay (and possibly in other cases) we're getting an exception:
Caused by: org.hibernate.HibernateException: A collection with cascade="all-delete-orphan" was no longer referenced by the owning entity instance: google.registry.model.domain.DomainHistory.internalDomainTransactionRecords
The main cause of this, according to research (StackOverflow :P) is that
when Hibernate is calling the setters for these sets of children it's
losing the connection to the previously-managed child entity (which it
needs, in order to know how to delete orphans). Thus, the solution is to
maintain the same instance of the persistent set and just add/remove
to/from it as necessary.
This is complicated by the fact that sometimes the setter is given the
persistent set (the one we want to keep) and sometimes (?) it isn't. We
will need to try this out to be sure.
see https://b.corp.google.com/issues/208629747 for details; this brings
in an old Gradle version as a transitive dependency
Version 2.x of the task-tree plugin uses Gradle 6.8 (or higher)
The cardinality for the paths is unbound, and could generate a huge
amount of metrics if someone is scanning our web WHOIS endpoint.
See b/209488119 for an example of such a sudden increase in metric volume.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1451)
<!-- Reviewable:end -->
* Add replicateToDatastore to non-prod cron files
This shouldn't do anything yet (since ReplicateToDatastoreAction checks the
migration state before doing anything) but we'll want to have this in
place.
* Filter out empty dsData objects, not just null ones
Hibernate/SQL will get mad if the digest is null or empty, and
previously we only check for null. We should filter out empty digests as
well.
* Properly handle Joda Money in JPA
Joda Money has BigDecimal as amount, which is mapped to a numeric(19,2)
column in the database. As a result, the Money amount load from DB has
scale 2. This becomes a problem with currencies such as JPY, which
requires scale to be 0. To properly load a currency, we must adjust the
scale post-load.
The current approach, which uses Hibernate component mapping, puts the
burden of post-load cleanup on each entity type that uses Money. It is
easy to forget this, as we just discovered.
This PR uses a CompositeUserType to map Money. It adjusts the scale
properly when loading Money instances. Although CompositeUserType appear
to be deprecated in Hibernate 6, it is the only proper solution right
now for mapping non-owned classes.
This is what's causing https://b.corp.google.com/issues/208274109, where
there are DTR rows with null foreign key values.
We should probably wait to make the columns officially non-null until we
get this in and verify that we can do so.
* Write commit logs during SQL->DS replay
Previously, we had no way to ignore read-only mode while still writing
commit log backups. Now, we added this so we can write commit logs in
the SQL->DS replay.
Note:
- When moving to either of the DATASTORE_PRIMARY stages, one must
manually set the SqlReplayCheckpoint first. We don't write to SQL with
backup in this stage because we already wrote the transaction in
question to Datastore. The fact that we manually set the replay
checkpoint means that we'll ignore the extra commit logs that might
otherwise cause problems if we switched back and forth from
DATASTORE_PRIMARY to SQL_PRIMARY.
- The commit logs written during the SQL_PRIMARY phase will, ideally, be
unused. We write them here only so that in the event of a rollback to
Datastore, we will have them for RDE purposes.
This is a result of bad data (we should never allow a null digest) and
we'll need to fix that separately, but this allows us to not fail on
this during replay
* Add NotLoggedInException tests to flows and flow docs
This wasn't included in flows.md before because the test existed in
ResourceFlowTestCase. So even though the exception could be thrown and
even though this was tested, it wasn't picked up in the documentation
because the documentation is picked up from the corresponding concrete
test class.
* Validate SQL with Datastore being primary
Validates the data asynchronously replicated from Datastore to SQL.
This is a short term tool optimized for the current production database.
Tested in production.
We want to keep the read-only-mode-exception as an unchecked exception,
so we introduce a temporary check in the EppController that provides a
specific error message for this situation (rather than letting it fall
through to the generic "command failed" messaging
* Replace with stringify() and VKey.create(string)
* Convert implicit cases of VKey.fromWebsafeKey(string)
* Convert from Key to VKey to use stringify()
* Modify existing code to show correct string representation of a key
* Use VKey.create(websafeKey) to get ofy key in ResaveEntitiesCommand
* Add TODO note in CommitLogMutation and determine if key string should be modified
* Revert from stringify() to getOfyKey().getString()
* Add bug ids to TODOs
* Ignore read-only mode in SQL->DS replication process
We need to be able to save indices and save data about the replication
even when we're in read-only mode.
We can handle it the same way that we handle UpdateAutoTimestamp, where
we simply populate it in SQL if it doesn't exist. This has the following
benefits:
1. The converter is unnecessary code
2. We get non-null column definitions for free (overridden in
EppResource to allow null creation times so that legacy *History objects
can contain null in that field
3. More importantly, this allows us for proper SQL->DS replay. If the
field is filled out using a converter (as before this PR) then the field
is only actually filled out on transaction commit (rather than when the
write occurs within the transaction). This means that when we serialize
the Transaction object during the transaction (the data that gets
replayed to Datastore), we are crucially missing the creation time.
If the creation time is written on commit, we have to start a new
transaction to write the Transaction object, and it's an absolute
necessity that the record of the transaction be included in the
transaction itself so as to avoid situations where the transaction
succeeds but the record fails.
If the field is filled out in a @PrePersist method, crucially that
occurs on the object write itself (before transaction commit).
The original RDE pipeline was a direct translation of the App Engine
MapReduce logic. It turned out to be too slow (taking more than a day to
run) due to the way it finds the most recent history entry.
This PR overhauled the pipeline by using embedded EPP resource entities
inside history entries (only available in SQL) and finding the most
recent entries using the SQL engine. It cuts the time done to ~2h.
Note that there are quota limits on the CPU cores and external IP
addresses for a given GCP region inside a project, which will need to
accommodate the resource requirements for the pipeline. More details are
provided in comments.
Also merged the update cursor stage and enqueue next action stage in
RdeIO so that they can be done within a transaction, same as how
MapReduce handles them.
<!-- Reviewable:start -->
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1427)
<!-- Reviewable:end -->
* Change TaskOptions to Task in CommitLogFanoutAction
* Add a createTask method that takes clock and jitterSeconds
* Change CreateTask parameter type and improve test cases
* Improve comments and test casse
* Improve test cases that handel jitterSeconds
* Grandfather in old data for one-time billing event requirement
We have data from 2018 and earlier where we didn't consistently set periodYears
for OneTime BillingEvents with certain reasons. This grandfathers in that old
data so that we can successfully move it over to Cloud SQL for now, then we can
later run a query that will backfill it, after which we can then tighten up the
requirement again. Note that the requirement is still being enforced for all
billing events from 2019 onwards.
This also improves the handling of validation, by adding a private field to the
Reason enum rather than creating a throwaway inline ImmmutableSet in the
Builder.
BSD sed requires a parameter to -i to indicate the backup suffix. By
adding a blank suffix the sed command works on both Linux and macOS.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1421)
<!-- Reviewable:end -->
* Make TaskMatcher default to POST methods
TaskOptions.Builder.withUrl() defaults to POST methods. Therefore, it seems
reasonable to verify that task queue methods are using the POST method,
especially given that the method must now be identified explicitly when using
CloudTaskUtils. This check would have guarded against the bug fixed by #1413.
* Elaborate on comment
* Further improved the comment
* Remove the ineffective SQL injection check
Remove the ineffective SQL-injection attack check in go/r3pr/954. It is
quite restrictive, causing a long exempt list. It also doesn't protect
queries made through helpers such as QueryComposer etc.
We will start from scratch for a new solution.
* Add the Cloud SQL queries for transaction reports
* Add the remaining queries
* Some query fixes
* Fix comments
* Fix indentation in total_nameservers
* Fix indentation on other Case condition
* Fix InitSqlPipeline regarding synthesized history
There are a few bad domains in Datastore that we hardcoded to ignore
during SQL population. They didn't have history so we didn't try to
filter when writing history.
Recently we created synthesized history for domains, including the bad
domains. Now we need to filter History entries.
* Support shared database snapshot
Allow multiple workers to share a CONSISTENT database snapshot. The
motivating use case is SQL database snapshot loading, where it is too
slow to depend on one worker to load everything.
This currently is postgresql-specific, but will be improved to be
vendor-independent.
Also made sure AppEngineEnvironment.java clears the cached environment
in call cases when tearing down.
* Update terraform files and instructions
Update proxy terraform files based on current best practices and allow
exclusion of forwarding rules for HTTP endpoints. Specifically:
- Add a "public_web_whois" input to allow disabling the public HTTP
whois forwarding.
- Add "description" fields to all variables.
- Move outputs of the top-level module into "outputs.tf".
- Auto-reformat using hclfmt.
* Make entities serializable for DB validation
Make entities that are asynchronously replicated between Datastore and
Cloud SQL serializable so that they may be used in BEAM pipeline based
comparison tool.
Introduced an UnsafeSerializable interface (extending Serializable) and
added to relevant classes. Implementing classes are allowed some
shortcuts as explained in the interface's Javadoc. Post migration we
will decide whether to revert this change or properly implement
serialization.
Verified with production data.
This is used for the replay locks so that Beam pipelines (which will be
used for database comparison) can acquire / release locks as necessary
to avoid database contention. If we're comparing contents of Datastore
and SQL databases, we shouldn't have replay actively running during the
comparison, so the pipeline will grab the locks.
Beam doesn't always play nicely with loading from / saving to Datastore,
so we need to make sure that we store the replay locks in SQL at all
times, even when Datastore is the primary DB.
* Re-enable replay tests for most environments
This enables the replay tests except in environments where
the NOMULUS_DISABLE_REPLAY_TESTS environment variable is set to "true".
* Add a check for null
* Alt entity model for fast JPA bulk query
Defined an alternative JPA entity model that allows fast bulk loading of
multi-level entities, DomainBase and DomainHistory. The idea is to bulk
the base table as well as the child tables separately, and assemble them
into the target entity in memory in a pipeline.
For DomainBase:
- Defined a DomainBaseLite class that models the "Domain" table only.
- Defined a DomainHost class that models the "DomainHost" table
(nsHosts field).
- Exposed ID fields in GracePeriod so that they can be mapped to domains
after being loaded into memory.
For DomainHistory:
- Defined a DomainHistoryLite class that models the "DomainHistory"
table only.
- Defined a DomainHistoryHost class that models its namesake table.
- Exposed ID fields in GracePeriodHistory and DomainDsDataHistory
classes so that they can be mapped to DomainHistory after being
loaded into memory.
In PersistenceModule, provisioned a JpaTransactionManager that uses
the alternative entity model.
Also added a pipeline option that specifies which JpaTransactionManager
to use in a pipeline.
I observed an instance in which a couple queries from this action were,
for whatever reason, hanging around as idle for >30 minutes. Assuming
the behavior that we saw before where "an open idle serializable
transaction means all pg read-locks stick around forever" still holds,
that's the reason why the amount of read-locks in use spirals out of
control.
I'm not sure why those queries aren't timing out, but that's a separate
issue.
* Fix problems with the format tasks
The format check is using python2, and if "python" doesn't exist on the path
(or isn't python 2, or there is any other error in the python code or in the
shell script...) the format check just succeeds.
This change:
- Refactors out the gradle code that finds a python3 executable and use it
to get the python executable to be used for the format check.
- Upgrades google-java-format-diff.py to python3 and removes #! line.
- Fixes shell script to ensure that failures are propagated.
- Suppresses error output when checking for python commands.
Tested:
- verified that python errors cause the build to fail
- verified that introducing a bad format diff causes check to fail
- verified that javaIncrementalFormatDryRun shows the diffs that would be
introduced.
- verified that javaIncrementalFormatApply reformats a file.
- verified that well formatted code passes the format check.
- verified that an invalid or missing PYTHON env var causes
google-java-format-git-diff.sh to fail with the appropriate error.
* Fix presubmit issues
Omit the format presubmit when not in a git repo and remove unused "string"
import.
* Add a beam pipeline to create synthetic history entries in SQL
The logic is mostly lifted from CreateSyntheticHistoryEntriesAction. We
do not need to test for the existence of an embedded EPP resource in the
history entry before create a synthetic one because after
InitSqlPipeline runs it is guaranteed that no embedded resource exists.
* Set payload in success response after sending expiring certificate notification emails
* Modify log message and test cases for run() in sendExpiringCertificateNotificationEmailAction
* Resolve merge conflict
* Include reason and requestedByRegistrar in URS test file
* Modify test cases for new parameters in renew flow
* Add reason and registrar_request to renew domain command
* Update comments for new params in renew flow
* Make changes based on feedback
* Update parameter to Datastore wipe pipeline
Add the newly required RegistryEnvironment parameter to
BulkDeleteDatastorePipeline.
Remove the nullable annotation for this parameter in options
class.
Update metadata files regarding this parameter.
* Implement several fixes affecting test flakiness
- Continued to do transaction manager cleanups on reply failure (lack of this
may be causing cascading failures.
- Fix UpdateDomainCommandTest's output check (the test was checking for error
output in standard error, but the command writes its output to the logs.
Apparently, these may or may not be reflected in standard error depending on
current global state)
- Remove unnecessary locking and incorrect comment in CommandTestCase. The
JUnit tests are not run in parallel in the same JVM and, in general, there
are much bigger obstacles to this than standard output stream locking.
* Fix bad log message check
This was added recently in PR #1341 as an attempted fix for our test flakiness,
but it turns out that it didn't address the root issue (whereas PR #1361
did). So this removes the fallback, as there's no reason this should ever be
called outside of a transactional context.
We're seeing some of these in CreateSyntheticHistoryEntriesAction and I
can't tell why from the logs (it doesn't appear to print the repo ID or
domain/host name)
* Add TmOverrideExtension for more safe TM overrides in tests
This is safer to use than calling setTmForTest() directly because this extension
also handles the corresponding call to removeTmOverrideForTest() automatically,
the forgetting of which has been a source of test flakiness/instability in the
past.
There are now broadly two ways to get tests to run in JPA: either use
DualDatabaseTest, an AppEngineExtension, and the corresponding JPA-specific
@Test annotations, OR use this override alongside a
JpaTransactionManagerExtension.
* Add a reference to RDAP conformance checker
Make a note of the RDAP conformance checker for the next time that we deal
with the RDAP code - would be nice to have this in the test suite.
* Reformat comment
* Add autorenews to URS (#1343)
* Add autorenews to URS
* Add autorenews to existing xml files for test cases
* Harmonize domain.get() in existing code
* Fix typo in test case name
* Modify existing test helper method to allow testing with different domain bases
* Customize LGTM build command
Our presubmit requires a version of python that is more recent than what
lgtm.com's build environments have installed. Instead of trying to upgrade
them or downgrade our python version, just do the steps of the build that LGTM
needs (i.e. just build the main classes and test classes).
* Fix BigQuery data set name handling in activity reporting
This is not a constant (as it depends on runtime state), so it can't be named
using UPPER_SNAKE_CASE. Additionally, it's not good practice to use field
initialization when there's logic depending on runtime state involved. So this
PR changes the class to use constructor injection and moves the logic into the
constructor.
* Add fix for ICANN reporting provide
* Extract out ICANN reporting data set
* Inject TransactionManager
* Make TransactionInfo static (per Mike)
* Use ofyTm() in BackupTestStore
* Revert extraneous formatting
* Use auditedOfy in CommitLogMutationTest
This PR adds the final step in RDE pipeline (enqueueing the next action
to Cloud Tasks) and makes some necessary changes, namely by making all
CloudTasksUtils related classes serializable, so that they can be used
on Beam.
<!-- Reviewable:start -->
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1330)
<!-- Reviewable:end -->
uberjar task and uberjar name are now different (beamPipelineCommon and
beam_pipeline_common, respectively). This is more idiomatic with regard
to naming conventions but we need to take two different variables now.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1369)
<!-- Reviewable:end -->
* Fix sandbox cron
"synchronized" can only be used to specify a 24h time range that is
evenly divided by the interval value, e. g. "every 2 hours
synchronized".
* Change to a different time
After #1348 it is no longer necessary to use AppEngineEnvironment in
Beam pipelines. In tests it is taken care of by the
DatastoreEntityExtension whereas on Dataflow the
RegistryPipelineWorkerInitializer does the same initialization for Ofy.
Both `DatastoreEntityExtension.PlaceholderEnvironment` and `AppEngineEnvironment` does the same thing, so there is no point having both of them exist. To use `AppEngineEnvionrment` as an autoclosable requires the user to be mindful of where a fake App Engine environment is required. It is better to set this either in the `DatastoreEntityExtension` for tests, or in the worker initializer in Beam. It also makes it easier to remove the fake environment when we are completely datastore free.
Also made a change to how `IdService` allocate Ids in Beam.
<!-- Reviewable:start -->
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1348)
<!-- Reviewable:end -->
This brings it in line with GetKeyringSecretCommand. We still need to
remove the rest of remaining Cloud KMS related code in the future.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1353)
<!-- Reviewable:end -->
* Add a cron job to periodically empty out fields on deleted entities that are at least 18 months old
* Process ContactHistory entities via batching
* Improve test cases by not making assertions in a loop
* Make :core:cleanTest depend on FilterTests
The "cleanTest" target doesn't work for our specialized tests derived from
FilterTest. Make them all explicit dependencies of cleanTest so we can reset
the tests from a single target.
* Add/use more DatabaseHelper convenience methods
This also fixes up some existing uses of "put" in test code that should be
inserts or updates (depending on which is intended). Doing an insert/update
makes stronger guarantees about an entity either not existing or existing,
depending on what you're doing.
* Convert more Object -> ImmutableObject
* Merge branch 'master' into tx-manager-sigs
* Revert breaking PremiumListDao change
* Refactor more insertInDb()
* Fight more testing errors
* Merge branch 'master' into tx-manager-sigs
* Merge branch 'master' into tx-manager-sigs
* Merge branch 'master' into tx-manager-sigs
* Merge branch 'master' into tx-manager-sigs
* Add removeTmOverrideForTest() calls
* Merge branch 'master' into tx-manager-sigs
We need these to get created (we are blocked from moving to SQL until 30
days after their creation) so reduce this to 3 in the hopes of avoiding
the SQL overloads while we debug why those are occurring in the first
place.
* Find a suitable version of python.
When running presubmit, we were using /usr/bin/python3, which works fine on
systems that have a reasonably recent python version there. However, our CI
system has a very old version of python there and prefers the use of "pyenv"
to modify the PATH to provide the desired version of python as simply
"python". So add a check to use the first of "python" or "/usr/bin/python3"
that is at least version 3.7.3.
* Add handling for UpdateAutoTimestamp when not in a transaction
It's not clear why this is sometimes causing test flakes, but getting better
logging involved should help clear it up.
This also changes AppEngineExtension to insert without reloading the initial
test data, rather than putting it (potentially involving a merge) and reloading
it in a separate transaction. This should hopefully reduce the chance of weird
conflicts.
It would have been nice if this had failed at compile-time rather than
an NPE, but we need to make sure to specify that we need to inject this
command to get e.g. the random string generator
In addition, print out only the names of the failed domains (rather than
the entire domain object) for readability.
* Add a presubmit to verify no new JS dependencies
Verify that we have a known set of javascript dependencies. This guards
against the inadvertent introduction of a new dependency with a disallowed
license.
TESTED: Added a new package to packages.json, observed presubmit failure.
* Replaced f-strings, printed python version
For some reason, it looks like we're using a python version older than 3.6 on
our CI machines.
* Remove python version trace.
There are actions for which we want to provide an override for the database
to use, like when launching Spec11 and Invoicing pipelines. It make sense to
consolidate around the same parameter provided from the same module for
consistency in all cases, instead of defining an override for each action.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1331)
<!-- Reviewable:end -->
* Add locking and a response in ReplicateToDatastoreAction
The response is necessary to get nicer logs in GAE and nicer cron job
behavior.
In addition:
- fix issues where locks would be backed up and replayed to Datastore
(they shouldn't be replayed)
- do ignore-read-only writes when replaying the transactions
* Fix javadoc problems with SoyInfo and subprojects
The *SoyInfo.java files generated by the soy compiler contain deprecation
warnings with links to files that are not imported. This causes a javadoc
warning. Temporarily fix this by replacing the link tags with "LINK". This
also allows us to remove the exclusion of these files, which is a bit nicer.
Also disable javadoc tasks from subprojects. These just break because they
don't have access to the legacy javadoc classes in the root.
The test for this also required a bit of a fix in the Cursor scope
initialization. If you persist a Key<?> in some object in Datastore, it
persists just the standard data you'd expect, basically the parent, the
kind, and the object's ID/name. Then, when you load it back in from
Datastore it uses the app ID of whatever environment you're loading in
(the Key contains this info even though it isn't included in the
toString() of Key)
If you persist the websafe string format of a Key (which is what we do
for Cursors), it includes the app ID so when you load it, it contains
the old app ID not the new one (if the app ID has changed).
In the pipelines, we use the standard default environment which has a
different app ID from the test environment that's set up by the
DatastoreEntityExtension.
As a result, we should check in Cursor to see if the key is *any*
cross-tld-key, rather than the exact one that exists within this app
environment.
* Clean up tx manager insert() signature and add convenience helper method
This is the first of a series of PRs to clean up the type signatures on the
TransactionManager methods (which are way too generic), along with creating some
helper methods for use in tests only that don't require creating transactions
all over the place, thus reducing visual noise at callsites. This first method
is DatabaseHelper.insertInDb(), but there will be plenty of others. Note that
this is only for the Cloud SQL transaction manager -- I'm not bothering to
migrate any Datastore-only code, as that will be going away soon enough.
* Skip synthetic history entries for resources that don't need them
The reason for creating synthetic history entries is so that we can
guarantee that each EppResource's most recent *History object contains
that resource at that point in time. If the most recent *History object
in SQL contains that resource already, there is no need to create a
synthetic *History object for that resource.
The build is generating the following lint warnings:
core/src/main/java/google/registry/flows/certs/CertificateChecker.java:246:
warning: [ReferenceEquality] Compariso
n using reference equality instead of value equality
&& (lastExpiringNotificationSentDate == START_OF_TIME
^
(see https://errorprone.info/bugpattern/ReferenceEquality)
core/src/test/java/google/registry/backup/ReplayCommitLogsToSqlActionTest.java:350:
warning: [UnnecessaryParenthes
es] These grouping parentheses are unnecessary; it is unlikely the code will
be misinterpreted without them
.that(jpaTm().transact((() -> jpaTm().loadByEntity(contactResource))))
* Rename client ID to registrar ID in most places
This is a code-only change, that shouldn't require any sort of data
migration. Correspondingly, there are some existing uses of clientId that are
not migrated (e.g. Datastore fields, task queue payloads, URL parameters for
actions that might be hit from task queues, etc.). And it of course doesn't
modify any fields in EPP XML. Note that the Cloud SQL schema fields are
already named using the registar_id pattern.
This also doesn't yet touch on the -c parameters in nomulus tools; that will be
coming later (since that is an external manual touch-point, it will require a
lot more in the way of changes to various meta scripts and documentation).
* Change more client IDs
* Merge branch 'master' into clientid-to-registrarid
* Add indexes to DomainHistory sub tables
Add indexes to DomainTransactionRecord and DomainDsDataHistory to speed
up query for DomainHistory. Without these indexes, DomainHistory loading
is extremely slow: 2 QPS with current production data.
* Rename Spec11Pipeline's Subdomain -> DomainNameInfo
"Subdomain" never made any sense as a class name because these are all
second-level domain names, along with a little bit of metadata such as some
registrar info. "DomainNameInfo" is a better fit.
* Update references to RDAP RFCs
There were minor changes to the RDAP RFCs used -- we don't need to
change anything since we already comply with all of the changes, but we
should refer to the newer RFCs in the code.
* Add a temporary fix to Hibernate detach in query
Make all queries in RegistryQuery (exclusively used by BEAM) use
EntityManager.clear() to detach entities. This is a temporary measure
that unblocks work in BEAM. We will revert the work once
JpaTransactionManager can detach entities properly for all types of
queries.
Also fixed regression bugs that broke query result streaming:
- The code that sets query fetch size was not carried over from
QueryComposer.
- The new ReadOnlyCheckingTypedQuery class did not override parent's
getResultStream() method, which calls getList().
* Add the DNS refresh request time field to the Domain tables
This isn't used yet, but it will eventually be the replacement for the dns-pull
task queue once we get further in the migration.
* Merge branch 'master' into domain-dns-dirty
* Remove VKeyTranslatorFactory.createVKey(String)
This method serves the same function as VKey.fromWebsafeKey(), and isn't used
anywhere. Move the test for it into VKeyTest and use it to instead test
fromWebsafeKey() (which didn't previously have a test).
I'm not sure why this test is failing. It's failing saying that the
listObjects call is failing to include
"soy_2000-01-01_thin_S1_R1.xml.ghostryde" in the results, however the
verifyFiles method that we call right beforehand verifies that file and
its contents
This includes a change to how the JPA transaction manager handles
existence and load checks for entities with compound IDs. Previously, we
relied on the fields all being named the same in the ID entity and the
parent entity. This didn't work for History objects (e.g. DomainHistory)
so existence checks were broken. Now, we use the methods the same way
that Hibernate does (if possible).
Note as well that there's a bit of semi-duplicated logic in
DeleteProberDataAction (between the mapper and the SQL logic). The
mapper code will be deleted once we've shifted to SQL, and for now it's
better to keep it in place for logging purposes.
This involves:
- Altering both transaction managers to check for a read-only mode at
the start of standard write actions (e.g. delete, put).
- Altering both raw layers (entity manager, ofy) to throw exceptions on
write actions as well
- Implementing bypass routes for reading / setting / removing the schedule itself
so that we don't get "stuck"
* Clean up ReplicateToDatastoreAction and tests
1. applyTransaction should throw an error if it fails; this allows us to
have more information in the caller (and it shouldn't usually happen)
2. Set a response code + payload now, since this is an action that is
called by cron
3. Add a method to the test log subject that allows us to check if a
severe log with a particular Throwable cause was logged (since the cause
isn't contained in the log message itself directly)
* Save indexes when replaying EppResources SQL->DS
We implement this similarly to how we implement the
beforeSqlSaveOnReplay callback in the other direction -- a
beforeDatastoreSaveOnReplay method that is called when replaying a
Mutation to Datastore. This means that the asynchronous replay will
create the relevant ForeignKeyIndex and EppResourceIndex objects for
EppResources saved when SQL is primary.
* Implement a util class to manage push queues using Cloud Tasks API
Push queues were part of App Engine when they debuted. As a result the
Task Queue API were part of the App Engine SDK and can only be used in
App Engine classic runtime. The new Cloud Tasks API can be used in any
runtime but it only supports push queues. In this PR we implement a util
class (CloudTasksUtils) like TaskQueueUtils to handle enqueuing tasks to
push queues using Cloud Tasks. One action (TldFanoutAction) was
converted to use the new API as a demo. Mass migration of other call sites of
the old API will follow in a separate PR.
TESTED=deployed to alpha and verified that tasks are corrected enqueued
and executed.
Add double-replay to the Host*Flow tests to show how this works. The
only change to the double replay itself is that now we store the
Datastore entity in the TransactionEntity object -- this is because we
use Objectify to serialize the objects into bytes and we need it to know
about the entity in question.
* Add DS->SQL replay cron job to production
This won't do anything until we set the migration schedule to
DATASTORE_PRIMARY. Actions in order:
1. Add this cron job (it'll be a no-op)
2. Run the init-sql-pipeline to populate production's SQL DB
3. Set the SqlReplayCheckpoint to a time before the smear backup that
was used in step #1 (maybe 30 minutes)
4. Set the database migration schedule to transition to
DATASTORE_PRIMARY at some point
* Remove ReservedList from Datastore schema
* Remove some Datastore references
* Add a different non-replicated entity to ReplayCommitLogsToSqlActionTest
We shouldn't reference tm() at all before initializing the JPA
transaction manager, since tm() looks at the database migration schedule
when figuring out which transaction manager to use.
* Remove recursive load in DBMSS cache
This occurs because if we do a standard transaction, the JpaTxnManager
checks to see if we should be doing backups, which involves loading the
migration state schedule (causing the recursion). When starting the
transaction to load the schedule, we should explicitly
transactWithoutBackup so there's no need to check.
This wasn't hit in tests because we previously manually set the
replication to not occur in the JpaTransactionManagerExtension -- we
remove that and related setters.
* Generate string to uniquely identify a SqlEntity
Add a method to SqlEntity that returns a string built from the entity's
primary key(s). This string can be used in logging.
* Add the domain DNS refresh request time field to the DB schema
This isn't used yet, but it will eventually be the replacement for the dns-pull
task queue once we get further in the migration.
* Remove index
* Store DatabaseMigrationSchedule in SQL instead of Datastore
This requires messing around with some of the JPA unit test rule
creation since it requires saving / retrieving the schedule pretty much
always (which itself includes the hstore extension).
This performs a direct load-by-key (the most efficient Datastore operation),
rather than attempting to load all entities by type using an ancestor query. The
existing implementation is possibly more error-prone as well, and might be
responsible for the "cross-group transaction need to be explicitly specified"
error we're seeing.
* Remove PremiumList from Datastore schema
* Remove commented out code
* Change lastUpdateTime to creationTimestamp
* Remove extra file
* Remove currency unit from input data to parse
* Revert extra file
* Check currency in parse
* Create all PremiumEntries before saving them in bulk
* small fixes
* Fix merge conflict
There was a subtle issue that we encountered in sandbox when using one
transaction per file that was difficult to replicate. Basically,
1. Save a domain with dsData
2. Save the domain without dsData
3. Save the domain with the same dsData as step 1
4. Delete literally any object
If one performs steps 2-4 in the same transaction, Hibernate will throw
an exception (cascade re-saving a cascade-deleted object). Note that
step 4 is in fact necessary to reproduce the issue, yay Hibernate.
We will test this and if one transaction per transaction is too slow,
we'll figure out ways to reduce the number of SQL transactions.
After the migration to the new GCS API it becomes apparent that the
BlobId.of() method needs to take the bucket name (without any trailing
directories) as the first argument. I did a search on all occurrences of
"BlobId.of" in the code base and verified that it is only in the ICANN
reporting job that the API was misused.
<!-- Reviewable:start -->
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1265)
<!-- Reviewable:end -->
This PR re-implements most of the logic in the RdeStagingReducer, with
the exception of the last enqueue operations, due to the fact that the
task queue API is not available outside of App Engine SDK. This part
will come in a separate PR.
Another deviation from the reducer is that we forwent the lock -- it is
difficult do it across different beam transforms. Instead we write each
report to a different folder according to its unique beam job name. When
enqueueing the publish tasks we will then pass the folder prefix as a
URL parameter.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1249)
<!-- Reviewable:end -->
This is instead of the current configuration parameter.
In addition, this adds some helpers to DatabaseHelper to make the
transitions easier, since we more frequently need to alter + reset the
schedule.
If we use the transaction manager methods, JpaTransactionManagerImpl
will attempt to detach the EppResource in question that we're loading --
this fails because that entity has been saved in the same transaction
already. We don't need detaching during these methods (it's just for
resource population) so we can use the raw loads to get around it.
* Remove KmsSecret model entities
Now that we have been using the SecretManager for almost a month now,
remove the KmsSecret and KmsSecretRevision entities from Java code base.
A follow-up PR will drop the relevant tables in the schema.
Also removed a few unused classes in the beam package.
* Fix runtime issues with commit-log-to-SQL replay
- We now use a more intelligent prefix to narrow the listObjects search
space in GCS. Otherwise, we're returning >30k objects which can take
roughly 50 seconds. This results in a listObjects time of 1-3 seconds.
- We now search hour by hour to efficiently make use of the prefixing.
Basically, we keep searching for new files until we hit the current time
or until we hit the overall replay timeout.
- Dry-run only prints out the first hour's worth of files
* Fix hanging threads in GcsDiffFileLister
Basically, whenever we request threads using the request thread factory,
we must be on the request thread itself. Dagger doesn't guarantee this
for us if we provide the ExecutorService directly in the action (or in
the GcsDiffFileLister), but we can gurantee that we're on the request
thread itself by simply injecting a Lazy, so that the executor is
instantiated inside the request itself.
In addition, add a timeout on the futures just in case.
* Use the DatabaseMigrationSchedule to determine which TM to use
We still allow the "manual" specification of a particular transaction
manager, most useful in @DualDatabaseTest classes. If that isn't
specified, we examine the migration schedule to see which to return.
Notes:
- This requires that any test that sets the migration schedule clean up
after itself so that it won't affect future test runs of other classes
(because the migration schedule cache is static)
- One alternative would, instead of having a "test override" for the
transaction manager, be to examine the registry environment and only
override the transaction manager in the UNIT_TEST environment. This
doesn't work because there are many instances in which tests simulate
non-test environment.
The API provided by the GAE SDK will not be available outside GAE
runtime. This presents a problem when we migrate off of GAE. More
pressingly, the RDE pipeline migration to Beam requires that we write to
GCS on GCE. Previously we were able to sidestep the issue by delegating
the writes to FileIO provided by Beam, which knows how to write to GCS.
However the RDE pipeline cannot use FileIO directly as it needs to write
to multiple files in one go and explicit use of GCS API is needed.
An unfortunate side effect of the API migration is that the new testing
library contains a bug which makes serializing GcsUtils impossible. It
is fixed upstream but not released yet. The fix has been backported for
the time being.
<!-- Reviewable:start -->
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1228)
<!-- Reviewable:end -->
* Restore commit logs from other project
Allow non-production projects to restore commit logs from another
project. This feature can be used to duplicate a realistic testing
environment.
An optional parameter is added that can override the default commit log
location.
Tested successfully in QA.
* Remove old DomainList fields from Registry
I also resaved all Registry objects in sandbox and production to make sure that the new field is populated on all entity objects.
* small fixes
* Some more small fixes
* Delete commented out code
* Remove existence check in tests
* Ensure VKey is actually serializable
Tighten field type so that non-serializable object cannot be set as
sqlKey.
This would make it easier to make EppResource entities Serializable in
the future.
* Set payload response in happy path of ReplayCommitLogsToSqlAction
I suspect this may be the reason the logs are missing on the happy path (when it
runs successfully), but are visible on the exception paths (which do set the
payload response). I don't think App Engine likes it when a Web request
terminates without a response.
This also adds more logging and error handling.
This is the first part of the RdeStagingAction SQL migration where the
mapper logic is implemented in Beam.
A few helper methods are added to convert the DomainContent, HostBase
and ContactBase to their respective terminal child classes. This is
necessary and possible because the child classes do not have extra
fields and the base classes exist only to be embedded to other entities
(such as the various HistoryEntry entities). The conversion is necessary
because most of our code expects the terminal classes, such as the
RdeMarshaller's various marshallXXX() methods. The alternative would be
to change all the call sites, which seems to be much more disruptive.
Unfortunately there is is no good way to do this conversion than just
creating a builder and setting every fields there is.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1219)
<!-- Reviewable:end -->
* Support testing SQL -> DS replication in ReplayExt
Support testing of Postgres -> Datastore replication in the ReplayExtension
when running in SQL mode in a DualDatabaseTest.
This is currently only enabled for one test (HostInfoFlowTest) since this form
of replication is likely to be problematic in many cases.
As part of this change:
- Add a thread-local flag so that we don't attempt to do certain data
transformations when serializing entities for storage in a Transaction
record. (These typically need to be called in a datastore transaction).
- Replace tm() in datastore translators with ofyTm() (these should only be
called from within an ofy transaction) and also in the replay system itself.
- Add a transactWithoutBackup() method for use within the replay itself.
- Prevent replication of entities that are not intended to be replicated.
- Make some of the ReplicateToDatastoreAction methods public so we can invoke
them from ReplayExtension.
- Change the way that the test type is stored in the extension context in a
DualDatabaseTest so that we can check for it from the ReplayExtension.
* Limit number of tests and show output
Trying to debug why these are failing in kokoro.
* Move HostInfoFlowTest to fragile for now
The test now manipulates a globel variable that causes problems for other
tests. There's likely a better fix for this, but for purposes of this PR we
can just move it to "fragile."
* Fix a few more problems
- "replay" flag should have been initialized to false -- as it stands,
replay wasn't happening.
- disable "always save with backup" in the datastore helper, we were
apparently getting some unwanted commit log entries that were causing
timestamp inversions in other tests. Also clear out the replay queue
just for good hygiene.
- Check for a null replicator in replayToOfy before proceeding.
- Use a local inOfyContext flag to track whether we're in ofy context, as
the tm() function is less reliable in dual-database tests.
* Set HistoryEntry modification time in FlowModule
Rather than having to set it individually to now (the current transaction time)
in every transactional flow, just do it once at the beginning when the
HistoryEntry.Builder is first being provided. This is also safer, as just doing
it in one place gives us stronger guarantees that it always corresponds to the
execution time of the flow, rather than leaving the potential open that in one
flow it's unintentionally set to the wrong thing.
* Rename a few soy files for consistency
This prefers the ResourceAction.soy naming convention for .soy files that
contain EPP XMLs so that they match the name of the corresponding EPP flow. E.g.
DomainDelete.soy now matches DomainDeleteFlow.java
* Fix BigDecimal precision of PremiumList.getLabelsToPrices()
Different currencies have different numbers of decimal places (e.g. USD has 2,
JPY has 0, and some even have 3). Thus, when loading the contents of a premium
list, we need to set the precision correctly on all of the BigDecimal prices.
This issue was introduced as part of the Registry 3.0 database migration when we
changed each PremiumEntry to being a Money to a BigDecimal (to remove the
redundancy of storing the same currency value over and over).
* Add SQL functionality to DeleteLoadTestDataAction
This isn't directly meant to be run in production so some of the rough
edges (doesn't delete domains, can't delete contacts that are referenced
by an existing domain) are fine. We can handle those in
DeleteProberTestAction when we do the more comprehensive deletions.
* Make SQL queries return scrollable results
With Postgresql, we must override the default fetchSize (0) to enable
scrollable result sets. Previously we only did this in QueryComposer.
In this change we enable scrollable results for all queries by default.
We also provide a helper function
(JpaTransactionManager.setQueryFetchSize) that can override the default.
* Fix appId during cross-project commitlog imports
When importing commit logs from another project, we must override the
appId in every entity key instances.
The fixEntity method in the EntityImports class is a straightforward
translation of the python function of the same name used by the
storage team.
QueryComposer could be used when the transaction manager is not
determined (i. e. it supports both ofy and sql), but this also imposes
limits on what you can do with it. For example it does not support IN
operator in the where clause.
Since QueryComposer itself creates a CriteriaQuery for JPA TM it make
sense to have RegistryJpaIO take a CriteriaQuery directly as it only
uses JPA.
Also add some more helper methods to use native queries and typed
queires, and fix some generic type warnings.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1209)
<!-- Reviewable:end -->
In SQL we do not need to wrap it in a Result. Unfortunately we cannot
overload a function based on its return value so we renamed the existing
one and created a new one with the old name that returns the resource
directly. Once we no longer have use of Datastore we can delete the now
renamed function that returns a Result<? extends EppResource>
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1205)
<!-- Reviewable:end -->
* Add Cloud SQL read to Spec11Pipeline
* Add database option
* Add database parameter
* Add a test of the full pipeline
* Use DatabaseHelper in tests
* restore the original tm
* More test fixes
In testSuccess_expandSingleEvent_notIdempotentforDifferentRecurring(),
two Recurring entities are created with the only difference being their IDs. If
we don't order the Recurrings by ID when loading them there is no guarantee
which one is expanded first. In this test the expected OneTime entities are
created with the assumption that the first loaded DomainHistory (parent of a
OneTime) corresponds to the expanding the Recurring with the smaller ID (2L).
Since the DomainHistory entities are loaded in order of IDs, and the IDs are
created monotonically in time in tests, we need to load the Recurrings in
order of their IDs to ensure that the first DomainHistory is the result of
expanding the Recurring with ID of 2L. This should impose minimum performance
penalty as we are ordering by the primary key.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1204)
<!-- Reviewable:end -->
* Convert EppResourceUtils::loadAtPointInTime to SQL+DS
This required the following changes:
- The branching / conversion logic itself, where we load the most recent
history object for the resource in question (or just return the resource
itself)
- For simplicity's sake, adding a method in the *History objects that
returns the generic resource -- this means that it can be called when we
don't know or care which subclass it is.
- Populating the domain's dsData and gracePeriods fields from the
DomainHistory fields, and adding factories in the relevant classes to
allow us to do the conversions nicely (the history classes are almost
the same as the regular ones, but not quite).
- Change the tests to use the clocks properly and to allow comparison of
e.g. DomainContent to DomainBase. The objects aren't the same (one is a
superclass of the other) but the fields are.
Note as well a slight behavioral change: commit logs only allow us
24-hour granularity, so two updates in the same day mean that the
earlier update is ignored and inaccessible. This is not the case for
*History objects in SQL; all versions are accessible.
* Fix Datastore "count" queries
The objectify "count()" method doesn't work for result sets larger than 1000
elements, use the original trick from "count domains" that fetches the keys
and counts them.
* Added an SO link
* Start the DS->SQL replay in non-prod environments
This should be a no-op since we haven't enabled it but this means that
when we set the schedule, we'll start replaying
* Fix a test flake in SetDatabaseMigrationScheduleCommandTest
The cache is static so some odd state may stick around between tests --
we should clear it
* Use DB migration state to determine running async replay SQL->DS
The SQL->DS replay likely could use more work (locking, returning the
right codes, things like that) but that's outside the scope of this PR.
* Make all criteria queries use jpaTm().query()
This causes all criteria queries to detach-on-load.
* Detach results of criteria queries
Wrap the criteria queries in DetachingTypedQuery now that the latter is
merged.
* Fix copy causing premature hash calculation
The creation of a builder to set the DomainContent repo id in DomainHistory
triggers an equality check which causes the hash code of an associated
transfer data object to be calculated prematurely, before the Ofy keys are
reconstituted. Replace this with a simple setter, which is acceptible in this
case because the object is being loaded and is considered to be not fully
constructed yet.
* Do setRepoId() in Contact and Host history
Not essential for these as far as we know, but it's safer and more consistent.
* Fixed typos
There is some complication regarding how the
CancellationMatchingBillingEvent of the generated OneTime can be
reconstructed when loading from SQL. I decided to only address it in
testing as there is no real value to fully reconstruct this VKey in
production where we are either in SQL or Ofy mode, both never in both.
Therefore the VKey in a particular mode only needs to contain the
corresponding key in order to function.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1181)
<!-- Reviewable:end -->
This includes the following changes:
- Convert the single-valued database migration state to a timed
transition property, meaning that we can switch all instances over at
the same time and schedule it in advance
- Use a "cache" (technically an expiring memoized supplier) when
retrieving the database migration state value
- Delete the old DatabaseTransitionSchedule because it is no longer
necessary. We took the idea from that and used it for the new
DatabaseMigrationStateSchedule, though we cannot reuse the entity itself
because the structure is fundamentally different.
- Removed references to the DatabaseTransitionSchedule, mainly in the
getter/setter commands+tests and a few odd references elsewhere.
There are cases where periodYears is not set when creating a OneTime
billing event, for example when performing a registry lock (default cost = $0)
or when performing a server status update, such as applying the
serverUpdateProhibited status (default cost = $20). This is not currently
handled currently in the billing pipeline because the parseFromRecord
method checks for nullness for all fields. Even if it does not validate
the fields, the null periodYears will still cause problem when the
billing event is converted to CSV files.
This PR alters the BigQuery SQL file to convert a NULL to 0 when
creating the BillingEvent in the invoicing pipeline. It also sets the EndDate
in the invoice CSV to an empty string when periodYears is 0. Note that when the
cost is also 0, the billing event is filtered out in the invoice CSV so only
the non-free OneTime with null periodYear will have an impact on the output.
For detailed reports all billing events are included and the zero
periodYears is printed as is.
Setting the EndDate to empty is the correct behavior per
go/manual-integration-csv#end-date.
* Use SecretManager for nomulus-tool-cloudbuild cred
Store cloudbuild's nomulus-tool credential in SecretManager and make the
deployment pipeline load it from the SecretManager.
The tool-credential.json.enc file in the
gs://domain-registry-dev-deploy/secrets folder is no longer needed.
* Make keyring use SecretManager as sole storage
The Keyring will only use the SecretManager as storage. Accesses to the
Datastore are removed.
Also consolidated KmsKeyringTest into KmsKeyingUpdaterTest. The latter
is left with its original name to facilitate code reviews. It will be
renamed in planned cleanups.
Additional cleanup is left for a future PR. These include:
- Remove KmsConnection and its associated injection modules
- Remove KmsSecretRevision from SQL schema and code
- Rename relevant files to more appropriate names.
* Added TransformingTypedQuery class
Added class to wrap TypedQuery so that we can detach all objects on load.
* Don't detach non-entity results; complete tests
* Changes for review
* Make non-static and call detach directly
* Remove labels from output of list_premium_lists
Remove the ability to show all of the labels associated with a premium list in
the list_premium_lists command. Supporting this requires loading the entire
contents of all premium lists from the database as opposed to just the list
records, and the information can be obtained using get_premium_list.
* Always detach entities during load
The mutations on non-transient fields that we do in some of the PostLoad
methods have been causing the objects to be marked as "dirty", and hibernate
has been quietly persisting them during transaction commit.
By detaching the entities on load, we avoid any possibility of this, which
works in our case because we treat all of our model objects as immutable
during normal use.
There is another mixed blessing to this: lazy loading won't work on these
objects once they are detached from a session, meaning that all fields must be
lazy loaded up front. This is unfortunate in that we don't always need those
lazy-loaded fields and there is a performance cost to loading them, but it is
also useful in that objects will now be complete when used outseide of the
transaction that loaded them (prior to this, an attempt to access a
lazy-loaded field after its transaction closed would have caused an error at
runtime).
* Changes requested in review
* A few improvements to test logic
* Deal with premature detachment of mutated objects
* Add unit tests, use a more specific exception
* Changes for review
- Deal with DomainDeleteFlow, which appears to be the only case in the
codebase where we're doing a load-after-save.
- Display the object that is being loaded after save in the exception message.
- Add a TODO for figuring out why Eager loads aren't working as expected.
* Move the recurring billing event into a parameter
* Changes for review and rebase error fix
* Remove initialization of list entries
Remove initialization of list entries that we want to be lazy loaded (premium,
reserved, and claims lists).
* Post-rebase cleanups
* Safely lazy load claims and reserved lists
This moves the entries of all of these lists into "insignificant" fields and
manages them explicitly.
* Additional fixes
Fix a few problems that came up in the merge or weren't caught in earlier
local test runs.
* Changes for review
- removed debug code
- added comments
- improved some methods that were loading the entire claims list
unnecessarily.
* Fixed javadoc links
* Reformatted
* Minor fix for review
This is useful when we expect a specific subtype in the return value so
that we can set the parent resource (e. g. DomainContent for
DomainHistory) on it, or when a specific subtype is needed from the call
site.
This PR also fixes some use of generic return values. It is always better to
return <HistoryEntry> than a wildcard <? extends HistoryEntry>, because for
immutable collections, <? extends HistoryEntry> is no different than
<HistoryEntry> as return value -- you can only get a HistoryEntry from it.
The wildcard return value means that even if you are indeed getting a
<DomainHistory> from the query, the call site has no compile time knowledge of
it and can only assume it is a <HistoryEntry>.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1172)
<!-- Reviewable:end -->
* Add an object to store database migration stages
go/registry-3.0-stage-management for more details
This basically boils down to storing an enum in the database so that we
can tell what stage of the migration we're in.
We use a cross-TLD parent so that we can have strong transactional
consistency on retrieval.
HistoryEntry is used to record all histories (contact, domain, host) in
Datastore. In SQL it is now split into three subclasses (and thus
tables): ContactHistory, DomainHistory and HostHistory. Its builder is
genericized as a result which led to a lot of compiler warnings for the
use of a raw HistoryEntry in the existing code base.
This PR cleans things up by replacing all the explicit use of
raw HistoryEntry with the corresponding subclass and also adds some
guardrails to prevent the use of raw HistoryEntry accidentally.
Note that because DomainHistory includes nsHosts and gracePeriodHistory,
both of which are assigned a roid from ofy when built, the assigned roids for
resources after history entries are built are incremented compared to
when only HistoryEntrys are built (before this PR) in
RdapDomainSearchActionTest.
Also added a convenient tm().updateAll() varargs method.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1150)
<!-- Reviewable:end -->
* Make PremiumList.labelsToPrices "insignificant"
Add the ImmutableObject.Insignificant annotation to labelsToPrices and also
mark it as Transient. In order to do lazy-loads on this field, we need to do
so explicitly: doing otherwise breaks the immutability contract and prevents
detaching the object upon load.
Note that this is an expedient solution to this problem, but not the optimal
one. Ideally, the disassociation between PremiumList and its PremiumEntry's
would be more explicit. However, breaking labelsToPrices out would at minimum
require reworking the Create/UpdatePremiumList commands, which currently rely
on passing around a self-contained PremiumList object, both from the parser
interfaces and to the database.
If this approach is acceptable, we can apply it to ReservedList and ClaimsList
as well (though it may be easier to break the association in those cases).
* Fix premium list "delete" to support a test
* Fix a few more tests
* Changes for review (updated javadocs)
* Minor fixes
* Updated getLablesToPrices() comment
* Format fixes, fixed PremiumEntry interfaces
PremiumEntry can now be SQL only.
* Add loadOnlyOf method to tm()
In addition there's a bit of a refator of SqlReplayCheckpoint to make it
more in line with the other singletons. This method is useful for the
singleton classes where we expect at most one entity to exist, e.g.
ServerSecret.
* Convert most poll message queries to QueryComposer
* Add unit test and a better exception for datastore
* Remove datastorePollMessageQuery from PollFlowUtils
* Reformatted.
* Improved test equality checks
* Changes for review
* Converted concatenated string to String.format()
* Update GCL dependency to avoid security alert
This required a few changes in addition to the dependency update.
- a few transitive / required dependency updates as well
- updating soyutils_usegoog.js and adding checks.js because they're
necessary as part of the Soy compilation process
- Using a trustedResourceUri in the buildSrc Soy compilation instead of
a string
- changing the arguments to the Soy-to-Java compiler to comply with the
new version
- Moving all Soy UI files to be in the registrar directory. This was
not the case before due to previous thinking that we'd have separate
admin and registrar consoles -- this is no longer the case so it's no
longer necessary. This necessitated various refactorings and reference
changes.
- The new soy-to-javascript compiler requires this, as it removes the
"deps" param that we were previously using to say "use the general UI
utils as dependencies for the registrar-console files".
- Creating a SQL environment and loading test data in the test server
main method -- previously, the local test server did not work.
- Fix some JS code that was referencing now-deleted library functions
- Removal of the Karma tests, as the karma-closure library hasn't been
updated since 2018 and it no longer works. We never noticed any errors
from the Karma tests, we never change the JS, and we have the
Java+Selenium screenshot differ tests to test the UI anyway.
* Add sanity checks to history entry construction
* Add more missing setClientId() calls and delete scrap tool
* Merge branch 'master' into synthetic-requestedby
* Set more client IDs
* Merge branch 'master' into synthetic-requestedby
* Fix RegistryJpaIO.Read problem with large data
The read connector needs to detach loaded entities. This
is now the default behavior in QueryComposer
Also removed the 'transaction mode' property from the Read connector.
There are no obvious use cases for non-transaction query, and
implementation is not straightforward with the current code base.
Also changed the return type of QueryComposer.list() to ImmutableList.
* Clean up some SqlEntity classes
This started as having a better check for when to run the
ReplayCommitLogsToSqlAction but that'll require a bit more thought, and
this is a fairly simple PR that can be split out.
* Add mapreduce action to create synthetic history entries
RDE and zone file generation require being able to tell what objects
looked like in the past (though not beyond 30 days, or whatever the
Datastore retention period is set to). In Datastore, to answer this we
look at commit logs, and in SQL we will look at the History objects
stored for each EPP resource. This action can be run once while in
Datastore-primary-SQL-secondary to make sure that every EPP resource has
at least one history entry for which the resource-at-this-time field is
filled out in the SQL world.
* Fix the JPA Read connector for large data
Allow result set streaming by setting the fetchSize on JDBC statements.
Many JDBC drivers by default buffers the entire result set, causing
delays in first result and/or out of memory errors.
Also fixed a entity instantiation problem exposed in production runs.
Lastly, removed incorrect comments.
* Restore a fix for flaky test
Restore a speculative fix for the flakiness in
DeleteExpiredDomainsActionTest. Although we identified a bug and fixed
it in a previous commit, it may not be the only bug. The removed fix may
still be necessary.
* Combine the two Lock classes into one class
This allows us to remove the DAO and to just treat locks the same as we
would treat any other object -- generically grabbing them from the
transaction manager.
We do not need to be concerned about the changeover between Datastore
and SQL because we assume that any such changeover will require
sufficient downtime that any currently-valid acquired locks will expire
during the downtime. Otherwise, we could get into a situation where an
action has acquired a particular lock in Datastore but not SQL.
* Fix timestamp inversion bug
Set the number of commitLog buckets to 1 in CommitLog replay tests to
expose all timestamp inversion problems due to replay. Fixed
PollAckFlowTest which is related to this problem.
Also fixed a few tests that failed to advance the fake clock when they
should, using the following approaches:
- If DatabaseHelper used but clock is not injected, do it. This
allows us to remove some unnecessary manual clock advances.
- Manually advance the clock where convenient.
- Enable clock autoIncrement mode when calling production classes that
performs multiple transactions.
We should consider making 1-bucket the default setting for tests. This
is left to another PR.
* Add an auditedOfy marker method for allow-listed ofy() calls
This will allow us to make sure that every usage of ofy() has been
hand-examined and specifically allowed.
In SQL the contact of a domain is an indexed field and therefore we can
find linked domains synchronously, without the need for MapReduce.
The delete logic is mostly lifted from DeleteContactsAndHostsAction, but
because everything happens in a transaction we do not need to recheck a
lot of the preconditions that were necessary to ensure that the async
delete request still meets the conditions that when the request was
enqueued.
* Populate the contact in ContactHistory objects created in Contact flows
Minimal interesting changes here
- a bit of reconstruction in ContactHistory to get the repo ID from the
key
- making the History revision ID Long instead of long so that it can be
null in non-built intermediate entities
- adding a copyFrom(HistoryEntry.Builder) method in HistoryEntry.Builder
so that we don't need to allocate quite as many unnecessary IDs, i.e.
removing the .build() lines in provideContactHistory and
provideDomainHistory
* Add a BEAM read connector for JPA entities
Added a Read connector to load JPA entities from Cloud SQL.
Also attempted a fix to the null threadfactory problem.
There has been a case where the CI was broken on Friday and no one
noticied or fixed it and a RC build was built with broken tests.
The tests were disabled due to unknown test failures that have since
been fixed.
Also update the machine type used by GCB to be more powerful. This is
necessary for the tests to past because N1_HIGHCPU_8 is RAM constraint
and the tests crashes. I updated all jobs to use the new type which
hopefully will make the build faster as well.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1130)
<!-- Reviewable:end -->
This also forces the karma test to use the Gradle-installed version of
node instead of the global version. The global version installed on the
Kokoro machines is too old to function with some of the newer libraries.
Unfortunately, much of the time there's a bit of a circular dependency
in the object creation, e.g. the Domain object stores references to the
billing events which store references to the history object which
contains the Domain object. As a result, we allocate the history
object's ID before creating it, so that it can be referenced in the
other objects that store that reference, e.g. billing events.
In addition, we add a utility copyFrom method in HistoryEntry.Builder to
avoid unnecessary ID allocations.
* Remove unnecessary MockitoExtension from Spec11PipelineTest
This is kind of a shot in the dark here, but this is one of the obvious
differences between this test class (which frequently experiences flakes) and
the other pipeline test classes which do not.
It's also possible we were getting the wrong runner if the test framework was
incorrectly detecting an App Engine runtime environment, so I added an assert
that will make it very clear if this is the cause of any failures.
* Handle bad production data when migrating to SQL
Ignore or fix bad entites when populating SQL with production data in
Datastore. These are mostly inconsistent foreign keys.
See b/185954992 for details.
In tests we use a TestPipelineExtension which does some static
initialization that should not be repeated the same JVM. In our
XXXPipeline classes we save the pipeline as a field and usually write lambdas
that are pass to the pipeline. Because lambdas are effectively anonymous inner
classes they are bound to their enclosing instances. When they get serialized
during pipeline execution, their enclosing classes also do. This might result
in undefined behavior when multiple lambdas in the same XXXPipeline are used
on the same JVM (such as in tests) where the static initialization may be done
multiple times if different class loaders are used. This is very
unlikely to happen but as a best practice we still remove them as
fields.
* Improve usability of WipeOutCloudSqlAction
Replace the "drop owned" statement with ones that drops only tables and
sequences. The former statement also drops default grants for the
nomulus user, which must be restored before the database can be used by
the nomulus server and tools.
* Convert GenerateLordnCommand to tm
This makes use of QueryComposer and adds a `list()` method to it.
Since there was no test for GenerateLordnCommand, this also implements one.
* Changes requested in review
* Add test for list queries
* Stream domains instead of listing them
* Reformatted
* Make nom_build not check for ".git" directory
nom_build tries to verify that it is in the root of the tree prior to doing
anything, however checking for a .git directory doesn't work in a merged
directory.
* Minor formatting fix to attempt to force rebuild
These tests are flaky due to some kind of contention/collision on the mock task
queue. Retrying seems to fix the vast majority of flakes, is easy to implement,
and is more performant than moving these tests into the fragileTests test suite.
Note that there are many flow tests that aren't
@DualDatabaseTest-annotated yet but those will come later, as they will
require more changes to the flows (other PRs are coming or in progress).
This only includes the remaining EppResource flows that don't create a
history entry.
Nothing super crazy here other than persisting the entity changes in
DomainDeleteFlow at the end of the flow rather than almost at the end.
This means that when we return the results we give the results as they
were originally present, rather than the subsequently-changed values.
* Fix Spec11 domain check
We should be checking to see if there are _any_ active domains for a given
reported domain, not to see if _the_ domain for the name is active.
The last change caused an exception for domains with soft-deleted past domains
of the same name. The original code only checked the first domain returned
from the query, which may have been soft-deleted. This version checks all
domain records to see if any are active.
* filter().count() -> anyMatch()
* Begin saving the EppResource parent in *History objects
We use DomainCreateFlow as an example here of how this will work. There
were a few changes necessary:
- various changes around GracePeriod / GracePeriodHistory so that we can
actually store them without throwing NPEs
- Creating one injectable *History.Builder field and using in place of
the HistoryEntry.Builder injected field in DomainCreateFlow
- Saving the EppResource as the parent in the *History.Builder setParent
calls
- Converting to/from HistoryEntry/*History classes in
DatastoreTransactionManager. Basically, we'll want to return the
*History subclasses (and similar in the ofy portions of HistoryEntryDao)
- Converting a few HistoryEntry.Builder usages to DomainHistory.Builder
usages. Eventually we should convert all of them.
This is similar to the migration of the spec11 pipeline in #1073. Also removed
a few Dagger providers that are no longer needed.
TESTED=tested the dataflow job on alpha.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/google/nomulus/1100)
<!-- Reviewable:end -->
* Defer all foreign keys in SQL
The main difference here is that the constraint violation exceptions
won't be thrown until the transaction is completed, rather than when the
insert is first performed within the transaction. We get the same error
message either way. The primary benefit to this is that when dealing
with large operations inside a single transaction (flows), we don't need
to worry about the order of insertions of removals with regards to
foreign keys.
* Migrate Spec11 pipeline to flex template
Unfortunately this PR has turned out to be much bigger than I initially
conceived. However this is no good way to separate it out because the
changes are intertwined. This PR includes 3 main changes:
1. Change the spec11 pipline to use Dataflow Flex Template.
2. Retire the use of the old JPA layer that relies on credential saved
in KMS.
3. Some extensive refactoring to streamline the logic and improve test
isolation.
* Fix job name and remove projectId from options
* Add parameter logs
* Set RegistryEnvironment
* Remove logging and modify safe browsing API key regex
* Rename a test method and rebase
* Remove unused Junit extension
* Specify job region
* Send an immediate poll message for superuser domain deletes
This poll message is in addition to the normal poll message that is sent when
the domain's deletion is effective (typically 35 days later). It's needed
because, in the event of a superuser deletion, the owning registrar won't
otherwise necessarily know it's happening.
Note that, in the case of a --immediate superuser deletion, the normal poll
message is already being sent immediately, so this additional poll message is
not necessary.
* Update various tests to work with SQL as well
The main weird bit here is adding a method in DatabaseHelper to
retrieve and initialize all objects in either database. The
initialization is necessary since it's used post-command-dry-run to make
sure that no changes were actually made.
* Convert CountDomainsCommand to tm
As part of this, implement "select count(*)" queries in the QueryComposer.
* Replaced kludgy trick for objectify count
* Modify ClaimsList DAO to always use Cloud SQL as primary
* Revert ClaimsList add changes to SignedMarkRevocationList
* Fix flow tests
* Use start of time for empty list
* replace lambda with method reference
* Upload latest version of RDE report to icann
Currently the RdeReportAction is hard coded to load the initial version
of a report. This is wrong when reports have been regenerated.
Changed lines are copied from RdeUploadAction.
* Implement query abstraction
Implement a query abstraction layer ("QueryComposer") that allows us to
construct fluent-style queries that work across both Objectify and JPA.
As a demonstration of the concept, convert Spec11EmailUtils and its test to
use the new API.
Limitations:
- The primary limitations of this system are imposed by datastore, for
example all queryable fields must be indexed, orderBy must coincide with
the order of any inequality queries, inequality filters are limited to one
property...
- JPA queries are limited to a set of where clauses (all of which must match)
and an "order by" clause. Joins, functions, complex where logic and
multi-table queries are simply not allowed.
- Descending sort order is currently unsupported (this is simple enough to
add).
* Fix bug that was incorrectly assuming Cursor would always exist
In fact, the Cursor entity does not always exist (i.e. if an upload has never
previously been done on this TLD, i.e. it's a new TLD), and the code needs to be
resilient to its non-existence.
This bug was introduced in #1044.
* Use lazy injection in SendEscrow command
The injected object in SendEscrowReportToIcannCommand creates Ofy keys
in its static initialization routine. This happens before the RemoteApi
setup. Use lazy injection to prevent failure.
* Specify explicit ofyTm usage in SetDatabaseTransitionScheduleCommand
We cannot use the standard MutatingCommand because the DB schedule is
explicitly always stored in Datastore, and once we transition to
SQL-as-primary, MutatingCommand will stage the entity changes to SQL.
In addition, we remove the raw ofy() call from the test.
* Migrate Keyring secrets to Secret Manager
Implented dual-read of Keyring secrets with Datastore as primary.
Implemented dual-write of keyring secrets with Datastore as primary.
Secret manager write failures are simply thrown. This is fine since all
keyring writes are manual, throught eh update_kms_keyring command.
Added a one-way migration command that copies all data to secret manager
(unencrypted).
* Upgrade testcontainers to work around a race
testcontainers 1.15.? has a race condition that occassionally causes deadlocks.
This can be worked around by upgrading to 1.15.2 and set transport type to
http5.
See https://github.com/testcontainers/testcontainers-java/issues/3531
for more information.
There are two changes that are not lockfiles:
- dependencies.gradle
- java_common.gradle
* Convert TmchCrl and ServerSecret to cleaner tm() impls
When I implemented this originally I knew a lot less than I know now
about how we'll be storing and retrieving these singletons from SQL. The
optimal way here is to use the single SINGLETON_ID as the primary key,
that way we always know how to create the key that we can use in the
tm() retrieval.
This allows us to use generic tm() methods and to remove the handcrafted
SQL queries.
* Enforce consistency in non-cached FKI loads
For the cached code path, we do not require consistency but we do
require the ability to load / operate on large numbers of entities (so,
we must do so without a Datastore transaction). For the non-cached code
path, we require consistency but do not care about large numbers of
entities, so we must remain in the transaction that we're already in.
There is no need for this test now because we've past the enforcement
date. We should take out the entire enforcement date logic but right now
this test is failing because TLS 1.1 is not being supported anymore by
the latest release of JDK 11.
The other test is a bit tricky to fix, see comment.
Disable these tests for now to unblock development.
* Add a beforeSqlSave callback to ReplaySpecializer
When in the Datastore-primary and SQL-secondary stage, we will want to
save the EppResource-at-this-point-in-time field in the *History
objects so that later on we can examine the *History objects to see what
the resource looked like at that point in time.
Without this PR, the full object at that point in time would be lost
during the asynchronous replay since Datastore doesn't know about it.
In addition, we modify the HistoryEntry weight / priority so that
additions to it come after the additions to the resource off of which it
is based. As a result, we need to DEFER some foreign keys so that we can
write the billing / poll message objects before the history object that
they're referencing.
* Partially convert EppResourceUtils to SQL
Some of the rest will depend on b/184578521.
The primary conversion in this PR is the change in
NameserverLookupByIpCommand as that is the only place where the removed
EppResourceUtils method was called. We also convert to DualDatabaseTest
the tests of the callers of NLBIC. and use a CriteriaQueryBuilder in the
foreign key index SQL lookup (allowing us to avoid the String.format
call).
* Add -r when rsync a release to the live folder
Release folders now are no longer flat. Each of them has a 'beam'
subfolder with pipeline metadata files.
* Remove SQL credentials from Keyring
Remove SQL credentials from Keyring. SQL credentials will be managed by
an automated system (go/dr-sql-security) and the keyring is no longer a
suitable place to hold them.
Also stopped loading SQL credentials from they keyring for comparison
with those from the secret manager.
* Convert RefreshDnsOnHostRenameAction to tm
This is not quite complete because it also requires the conversion of a
map-reduce which is in scope for an entirely different work. Tests of the
map-reduce functionality are excluded from the SQL run.
This also requires the following additional fixes:
- Convert Lock to tm, as doing so was necessary to get this action to work.
As Lock is being targeted as DatastoreOnly, we convert all calls in it to
use ofyTm()
- Fix a bug in DualDatabaseTest (the check for an AppEngineExtension field is
wrong, and captures fields of type Object as AppEngineExtension's)
- Introduce another VKey.from() method that creates a VKey from a stringified
Ofy Key.
* Rename VKey.from(String) to fromWebsafeKey
* Throw NoSuchElementE. instead of NPE
* Correctly get the primary database value in PremiumListDualDao
* Remove extra AppEngineExtension
* get rid of ofy call
* Remove extra duration skip in test
* Convert poll-message-related classes to use SQL as well
Two relatively complex parts. The first is that we needed a small
refactor on the AckPollMessagesCommand because we could theoretically be
acking more poll messages than the Datastore transaction size boundary.
This means that the normal flow of "gather the poll messages from the DB
into one collection, then act on it" needs to be changed to a more
functional flow.
The second is that acking the poll message (deleting it in most cases)
reduces the number of remaining poll messages in SQL but not in
Datastore, since in Datastore the deletion does not take effect until
after the transaction is over.
* Fix some low-hanging code quality issue fruits
These include problems such as: use of raw types, unnecessary throw clauses,
unused variables, and more.
* Convert ofy -> tm for two more classes
Convert ofy -> tm for MutatingCommand and DedupeOneTimeBillingEventIdsCommand.
Note that DedupeOneTimeBillingEventIdsCommand will not be needed after
migration, so this conversion is just to remove the ofy uses from the
codebase. We don't update the test (other than to keep it working) and it
wouldn't currently work in SQL.
* Fixed a test broken by this PR
In addition, we move the deleteTestDomain method to DatabaseHelper since
it'll be useful in other places (e.g. RelockDomainActionTest) and remove
the duplicate definition of ResaveEntityAction.PATH.
We also can ignore deletions of non-persisted entities in the JPA
transaction manager.
* Update RegistrarSettingsAction and RegistrarContact to SQL calls
Relevant potentially-unclear changes:
- Making sure the last update time is always correct and up to date in
the auto timestamp object
- Reloading the domain upon return when updating in a new transaction to
make sure that we use the properly-updated last update time (SQL returns
the correct result if retrieved within the same txn but DS does not)
* Convert DomainTAF and DomainFlowUtils to SQL
The only tricky part to this is that the order of entities that we're
saving during the DomainTransferApproveFlow matters -- some entities
have dependencies on others so we need to save the latter first. We
change `entitiesToSave` to be a list to reinforce this.
Various necessary changes included as part of this:
- Make ForeignKeyIndex completely generic. Previously, only the load()
method that took a DateTime as input could use SQL, and the cached flow
was particular to Objectify Keys. Now, the cached flow and the
non-cached flow can use the same (ish) piece of code to load / create
the relevant index objects before filtering or modifying them as
necessary.
- EntityChanges should use VKeys
- FlowUtils should persist entity changes using tm(), however not all
object types are storable in SQL.
- Filling out PollMessage fields with the proper object type when
loading from SQL
- Changing a few tm() calls to ofyTm() calls when using objectify. This
is because creating a read-only transaction in SQL is quite a footgun at
the moment, because it makes the entire transaction you're in (if you
were already in one) a read-only transaction.
* Convert 3 classes from ofy -> tm
Convert SyncGroupMembersAction, SyncRegistrarsSheet and
IcannReportingUploadAction and their test cases to use TransactionManager and
dual-test them so we know they work in jpa.
* Address comments in review
Address review comments and make the entire IcannReportingUploadAction run
transactional.
* reformatted.
* Remove duplicate loadByKey() method
Remove test method added in a recent PR.
* Embed a ZonedDateTime as the UpdateAutoTimestamp in SQL
This means we can get rid of the converter and more importantly, means
that reading the object from SQL does not affect the last-read time (the
test added to UpdateAutoTimestampTest failed prior to the production
code change).
For now we keep both time fields in UpdateAutoTimestamp however
post-migration, we can remove the joda-time field if we wish.
Note: I'm not sure why <now> is the time that we started getting
LazyInitializationExceptions in the LegacyHistoryObject and
ReplayExtension tests but we can solve that by just examining /
initializing the object within the transaction.
* Make TransactionManager.loadAllOf() smart w.r.t the cross-TLD entity group
The loadAllOf() method will now automatically append the cross-TLD entity group
ancestor query as necessary, iff the entity class being loaded is tagged with
the new @IsCrossTld annotation.
* Add tests
* Add SQL wipeout action in QA
Added the WipeOutSqlAction that deletes all data in Cloud SQL.
Wipe out is restricted to the QA environment, which will get production
data during migration testing.
Also added a cron job that invokes wipeout on every saturday morning.
This is part of the privacy requirments for using production data in QA.
Tested in QA.
* Add some load convenience methods to DatabaseHelper
These can only be called by test code, and they automatically wrap the load
in a transaction if one isn't already specified (for convenience).
In production code we don't want to be able to use these, as we have to be
more thoughtful about transactions in production code (e.g. make sure that
we aren't loading and then saving a resource in separate transactions in a
way that makes it prone to contention errors).
* Add Gradle tasks to stage BEAM pipelines
Add a Gracle task to stage flex-template based pipelines for alpha and
crash environments.
This is a follow up to go/r3pr/1028, which is also under review.
Add the actual release tag and beam staging project id to the config
file. This allows the Nomulus server to find the right version of the
BEAM pipelines to launch.
* Disallow admin triggering of internal endpoints
Stop simply relying on the presence of the X-AppEngine-QueueName as an
indicator that an endpoint has been triggered internally, as this allows
admins to trigger a remote execution vulnerability.
We now supplement this check by ensuring that there is no authenticated user.
Since only an admin user can set these headers, this means that the header
must have been set by an internal request.
Tested:
In addition to the new unit test, verified on Crash that:
- Internal requests are still getting authenticated via the internal auth
mechanism.
- Admin requests with the X-AppEngine-QueueName header are rejected as
"unauthorized."
* Reformatted.
* Pass --java-binary to _all_ formatter invocations
When implementing a flag to pass in the java binary to
google-java-format-diff.py, I missed the location in showNoncompliantFiles
which gets run before a check.
This change also refactors the core logic of the script so that
google-java-format-diff.py is only called from one place and (in all but one
case) only one time.
Tested:
Ran check format and show, with and without diffs present in the tree.
This has the same issue as the domain-search action where the database
ordering is not consistent between Objectify and SQL -- as a result,
there is one test that we have to duplicate in order to account for the
two sort orders.
In addition, there isn't a way to query @Convert-ed fields in Postgres
via the standard Hibernate / JPA query language, meaning we have to use
a raw Postgres query for that.
* Add a jpaTm().query(...) convenience method
This replaces the more ungainly jpaTm().getEntityManager().createQuery(...).
Note that this is in JpaTransactionManager, not the parent TransactionManager,
because this is not an operation that Datastore can support. Once we finish
migrating away from Datastore this won't matter anyway because
JpaTransactionManager will be merged into TransactionManager and then deleted.
In the process of writing this PR I discovered several other methods available
on the EntityManager that may merit their own convenience methods if we start
using them enough. The more commonly used ones will be addressed in subsequent
PRs. They are:
jpaTm().getEntityManager().getMetamodel().entity(...).getName()
jpaTm().getEntityManager().getCriteriaBuilder().createQuery(...)
jpaTm().getEntityManager().createNativeQuery(...)
jpaTm().getEntityManager().find(...)
This PR also addresses some existing callsites that were calling
getEntityManager() rather than using extant convenience methods, such as
jpa().insert(...).
* Add replay to remaining (non-trivial) flow tests
Convert all remaining flow tests to do replay/compare testing. In the course
of this:
- Move the class specific SetClock extension into its own place.
- Fix another "cyclic" foreign key (there may be another solution in this case
because HostHistory is actually different from HistoryEntry, but that would
require changing the way we establish priority since HostHistory is not
distinguished from HistoryEntry in the current methodology)
* Attempt to fix flakey deleteExpiredDomain test
Though hard to reproduce locally, the test_deletesThreeDomainsInOneRun
test has failed multiple times on Kokoro. The root cause may be the
non-transactional query executed by the Action object, which was by
design. Observing that the other test never fails, this PR follows its behavior
and adds a transactional query before invoking the action.
* Allow nom_build to run in Cloudbuild
Our builder comes with python3.6 and cannot support nom_build out of
box. Nom_build requires dataclasses which is introduced in v3.7.
I haven't found an easy way to get python3.7+ without changing the base
linux image. This PR explicitly installs dataclasses.
The dual DAO takes care of switching between databases, comparing the
results of one to the results of the other, and caching the result. All
calls to ClaimsList retrieval or storing should use the
dual-database-DAO.
Previously, calls to comparing the lists were somewhat scattered
throughout the codebase. Now, there is one class for retrieval and
comparison (the dual DAO), one class for retrieval from SQL (the SQL
DAO), and one class for retrieval from Datastore (ClaimsListShard
itself, though the retrieval could be moved in to a separate DAO if we
wished).
In addition, we rename the ClaimsListDao to ClaimsListSqlDao
* Update creation script for schema_deployer
Move the create user command for schema_deployer before the
initialization of roles. As the owner of all schema objects, it needs to
be present before grant statements are executed.
Also fixed a bug in credential printing, which fails when the password
contains '%'.
This allows us to get rid of the DAO as well as the sanity-checking
methods since we can be reasonably sure that the fields will be the
same. Future PRs will add conversions from ofy() to tm() calls that will
make sure that we get the same proper data in both Datastore and SQL
* Convert more flow tests to replay/compare
Add the replay extension to another batch of flow tests. In the course of
this:
- Refactor out domain deletion code into DatabaseHelper so that it can be used
from multiple tests.
- Make null handling uniform for contact phone numbers.
* Convert postLoad method to onLoad.
* Remove "Test" import missed during rebase
* Deal with persistence of billing cancellations
Deal with the persistence of billing cancellations, which were added in the
master branch since before this PR was initially sent for review.
* Adding forgotten flyway file
* Removed debug variable
* Add schema_deployer SQL user to SecretManager
Add the 'schema_deployer' user to the SecretManager so that its
credential can be set up. The schema deployment process will use this
user instead of the 'postgres' user.
Changed the output of the get_sql_credential command for the schema
deployment process.
Added a sql script that documents the privileges granted to
'schema_deployer'.
* Clear autorenew end time when a domain is restored
This allows us to still see in the database which now-deleted domains had
reached expiration, while correctly not re-deleting the domain immediately if
the registrar pays to explicitly restore the domain.
This also resolves some TODOs around data migration for this field on domain so
that it's not null, as said migration has already been completed.
* Allow java-format to use java from the PATH
When invoking java from the google-java-format-git-diff.sh script, if there is
no JAVA_HOME environment variable, attempt to instead run the java binary that
is on the PATH.
This also adds a few checks to verify that a java binary is available in one
of those locations and that the version discovered is Java 11 (which we know
to be compatible with the google-java-format jar).
Tested:
- unset JAVA_HOME, verified that we get the version on the PATH
- Set JAVA_HOME to an invalid directory, verified that we get an error.
- Changed the "which" command to lookup an nonexistent binary, unset JAVA_HOME
and verified that we get a "java not found" error.
- Changed the path to point to an old version of java, verified that we get a
"bad java version" error.
- Verified that the script still runs normally.
* Remove grace period ID @OnLoads now that migration is complete
I verified in BigQuery that all grace period IDs are now allocated (as expected
given that the re-save all EPP resource mapreduce has been run several times
since this migration started last year). The query I used for verification is:
SELECT fullyQualifiedDomainName, gp, ot
FROM `domain-registry.latest_datastore_export.DomainBase`
JOIN UNNEST(gracePeriods.billingEventRecurring) AS gp
JOIN UNNEST(gracePeriods.billingEventOneTime) AS ot
WHERE gp.id IS NULL or ot.id IS NULL
BUG=169873747
* Add daily cron entries to for DeleteExpiredDomainsAction
This also requires setting this action to GET instead of POST, as GAE cron makes
GET requests.
* Use shared jar to stage BEAM pipeline if possible
Allow multiple BEAM pipelines with the same classes and dependencies to
share one Uber jar.
Added metadata for BulkDeleteDatastorePipeline.
Updated shell and Cloud Build scripts to stage all pipelines in one
step.
* Add comments to Cloud SQL configs
I believe the similarity in trace to https://github.com/brettwooldridge/HikariCP/issues/1212
is misleading.
The real cause of the exceptions may be that we ran out of connections. At the
time, the production Cloud SQL server could handle 500 connections at the
maximum. That number was within reach of a busy Nomulus server.
The maximum connection in production has been increased to 1000. We
haven't encountered this issue for a long time. All connection problems
are due to Cloud SQL maintenance or other GCP related issues.
This issue is tracked by b/154720215, which is being closed with this
PR.
* Convert DomainTransferRequestFlow to tm() calls
Besides the standard ofy-to-tm conversions this includes storing the
billing event cancellation VKey in the DomainTransferData object so that
we know to handle it on process / cancellation.
* Don't use --fork-point to determine merge base
It turns out that the --fork-point option is subtle and error-prone. Its
intent is not to show the nearest common base commit, but rather the commit
on a branch that the HEAD (in this case) was originally forked off of,
_whether it is currently part of the history of the specified branch or not_
(this can happen if the branch is rewritten). The option also relies on the
presence of the fork point in the reflog for the branch, which can be
discarded in the course of a "git gc".
It is fairly easy to construct a case where the use of --fork-point causes an
error and outputs nothing. In fact, I discovered the problem as a result of
this occuring spontaneously on one of my own branches (likely related to a
rebase). Since the fork-point is empty, we end up diffing against the index
instead of the common commit.
This may have been a factor in some of the unrelated reformatting that we've
seen in past PRs.
Change this to a simple "merge-base origin/master HEAD", which outputs the
commit id of the most recent common base revision.
This change also quotes the forkPoint variable, which likely would have
resulted in an error in this case instead of silently producing the wrong
output.
* Stage the init_sql_pipeline in CloudBuild
Defined metadata file and added Gradle uberJar task for the pipeline,
which are needed for staging.
Updated cloud build script to stage this pipeline during the build
processs.
* Add TODOs regarding cloud sql database name change
We should choose a different database name for nomulus data since using
'postgres' is bad practice. See b/181693544 more background.
We have decided to delay the db change to the time when we upgrade
postgresql version. This PR adds TODOs to all occurrences of the jdbcUrl
property, including those in the internal-repo. This property will change
when we upgrade, so the TODOs will be noticed.
* Use ReplaySpecializer to fix DomainBase replays
DomainBase currently has a number of ancillary objects that require a
cascading delete that doesn't get propagated. Implement beforeSqlDelete() in
DomainContent to delete these child entities.
* Remove unnecessary Query variable
* Fix rebase error
* Update more dependencies to newer versions
* Add lockfiles and back out 2 problematic dep updates
* Fix the build (backs out more changes)
* Back out qdox 2.0 too
* Print JAVA_HOME and PATH in the java format checker task
This allows is to identify if the java version is incompatible with the
formater checker jar file.
* Show which Java binary is used
* show java version
* Fix exeInBash arguments
* Print out env variables in java format
Print out JAVA_HOME and PATH variable in the google-java-format-diff.py script
immediately prior to running the underlying java program that does the actual
format checking.
* Use the java binary from JAVA_HOME for java-format
Use "$JAVA_HOME/bin/java" for invoking the java format check instead of
whatever version of java happens to be on the path.
* Removed unused import
* Rewrite the JPA output connector for BEAM
Following BEAM's IO connector style, added a RegistryJpaIO class to hold
IO connectors, and implemented the Write connector as a static inner
class in it. The JpaTransactionManager used by the Write connector
retrieves SQL credentials from the SecretManager.
Cleaned up SQL-related pipeline parameters.
Converted the InitSqlPipeline to use RegistryJpaIO.
* Add SQL queries to RdapDomainSearchAction
Unfortunately, because ORDER BY uses the locale's sorting functionality,
we end up with some weird sort orders in SQL-land (notably, periods are
ignored / omitted). As a result, a few of the tests have to be separated
out into ofy and SQL versions based on the expected sort order.
In addition, there isn't a way to query @Convert-ed fields in Postgres
via the standard Hibernate / JPA query language, meaning we have to use
a raw Postgres query for that.
* Added "show_upgrade_diffs" script
"show_upgrade_diffs" pulls a git directory and a user branch from nomulus and
compares all of the versions of all dependencies specified in all lockfiles in
the master branch with those of the user branch and prints a nice, terse
little colorized report on the differences.
This is useful for reviewing a dependency upgrade.
* Add license header
* Changes requested in review
* Changes for review
- Change format of output so different actions are displayed somewhat
consistently.
- Make specifying a directory optional, if not specified create a temporary
directory and clean it up afterwards.
* Add a "ReplaySpecializer" to fix certain replays
Due to the fact that a given entity in either database type can map to
multiple entities in the other database, there are certain replication
scenarios that don't quite work. Current known examples include:
- propagation of cascading deletes from datastore to SQL
- creation of datastore indexed entities for SQL entities (where indexes are a
first-class concept)
This change introduces a ReplaySpecializer class, which allows us to declare
static method hooks at the entity class level that define any special
operations that need to be performed before or after replaying a mutation for
any given entity type.
Currently, "before SQL delete" is the only supported hook. A change to
DomainContent demonstrating how this facility can be used to fix problems in
cascading delete propagation will be sent as a subsequent PR.
* Throw exception on beforeSqlDelete failures
* Changes for review
* Convert DomainTransferRejectFlow to use tm() methods
This change includes a few other necessary dependencies to converting
DomainTransferRejectFlowTest to be a dual-database test. Namely:
- The basic "use tm() instead of ofy()" and "branching database
selection on what were previously raw ofy queries"
- Modification of the PollMessage convertVKey methods to do what they
say they do
- Filling the generic pending / response fields in PollMessage based on what type of
poll message it is (this has to be done because SQL is not very good at
storing ambiguous superclasses)
- Setting the generic pending / repsonse fields in PollMessage upon
build
- Filling out the serverApproveEntities field in DomainTransferData with
all necessary poll messages / billing events that should be cancelled on
rejection
- Scattered changes in DatabaseHelper to make sure that we're saving and
loading entities correctly where we weren't before
* Disable whois caching in nomulus tool
The whois commands previously served output generated from cached EppResource
objects in most cases. While this is entirely appropriate from server-side,
it is less useful when these commands are run from nomulus tool and, in fact,
when run from the "shell" command this results in changes that have been
applied from the shell not being visible from a subsequent "whois". The
command may instead serve information on an earlier, cached version of the
resource instead of the latest version.
This implementation uses dagger for parameterization of cached/non-cached
modes. I did consider the possibility of simply parameterizing the query
commands in all cases as discussed, however, having gone down the
daggerization path and having gotten it to work, I have to say I find this
approach preferrable. There's really no case for identifying
cached/non-cached on a per-command basis and doing so would require
propagating the flag throughout all levels of the API and all callsites.
Tested: In addition to the new unit test which explicitly verifies the
caching/noncaching behavior of the new commands, tested the actual failing
sequence from "nomulus -e sandbox shell" and verified that the correct results
are served after a mutation.
* Fixed copyright year
* More copyright date fixes
* Added WhoisCommandFactoryTest to fragile tests
I suspect that this test has some contention with other tests, it's not clear
why.
* Replay Cloud SQL transactions against datastore
Implement the ReplicateToDatastore cron job that will apply all Cloud SQL
transactions to the datastore. The last transaction id is stored in a
LastSqlTransaction entity in datastore.
Note that this will not be activated in production until a) the cron
configuration is updated and b) the cloudSql.replicateTransactions flag is set
to true in the nomulus config file.
* Post-review changes
Fixed immutability issues with LastSqlTransaction, write a single transaction
at a time to datastore.
* Changes requested in review
* Get a batch of SQL transactions
Read a batch of SQL transactions at a time and then process them
transactionally against datastore.
* Bring this up-to-date with the codebase
* Changes requested in review
* Fixed date in copyright
* Clean up Gradle Flyway tasks in :db
Simplified the command line by revising the semantics of some
properties.
Added examples of Flyway task invocations.
This script still uses the GCS file-based credential. We will migrate it
to the Secret Manager soon.
* Log forbidden HTTP request method at warning
This seems like more reasonable. It will potential issues with how
requests are generated more discoverable in the log.
* Reject handshakes with bad TLS protocols and ciphers
* Fix protocols
* make cipher suite list static and fix tests
* Delete unnecessary line
* Add start time configuration for enforcement
* small format fix
* Add multiple ciphersuite test
* fix gradle lint
* fix indentation
GAE cron only issuse HTTP GET requests to the endpoint in question. This
particular only allows POSTs. As a result this cron job never succeeded.
This is not a big problem as this job is meant to catch up any
unforeseen upload failures and in case it needs to catch up but fails,
every month the staging job (which is enqueued corrected by cron) will
eventually catch everything to date.
* Update a few plugins for Java 11 compatibility
Guice 5.0.1 is now compatible with Java 11. However we don't
directly depend on Guice. Rather Soy depends on Guice. So I added a
direct dependency on Guice 5.0 just before Soy in order to frontload Soy
and pull in the newer version.
Mockito 3.7.7 is now compatible with Java 11. The complication is that
we need to use the inline version of Mockito, which among other things
also allows mocking for final classes (hooray!). It will eventually
become the default Mockito mock maker but for now it needs to be
manually activated.
Note that the inline version now introduces another warning:
```
OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
```
Which I think is WAI due to how the inline mock maker works. Waiting on
the author to confirm.
After these to changes the only illegal reflective access is caused by
App Engine SDK tools, which we will rid ourselves of when we migrate off
of GAE.
* Restore package-lock.json
* Add GetPremiumListCommand
When testing the premium list refactor, it would have been nice and
convenient to have this. Currently we have no way of inspecting premium
list contents that I'm aware of.
- Adds a CriteriaQueryBuilder class that allows us to build
CriteriaQuery objects with sane and modular WHERE and ORDER BY clauses.
CriteriaQuery requires that all WHERE and ORDER BY clauses be specified
at the same time (else later ones will overwrite the earlier ones) so in
order to have a proper builder pattern we need to wait to build the
query object until we are done adding clauses.
- In addition, encapsulating the query logic in the CriteriaQueryBuilder
class means that we don't need to deal with the complicated Root/Path
branching, otherwise we'd have to keep track of CriteriaQuery and Root
objects everywhere.
- Added a REPLAYED_ENTITIES TransitionId that will represent all
replayed entities, e.g. EppResources. Also sets this, by default, to
always be CLOUD_SQL if we're using the SQL transaction manager in tests.
- Added branching logic in RdapEntitySearchAction based on that transition
ID that determines whether we do the existing ofy query logic or JPA
logic.
Because we don't store serverApproveEntities specifically as a set in
the SQL world, we need to make sure that the entities are all separated
and stored if they exist. For domain transfers, there exist three
separate poll messages (client losing, client gaining, autorenew) so we
need to store and retrieve that one.
Founnd this while converting domain transfer flows to SQL.
* Partially convert RDAP ofy calls to tm calls
This converts the simple retrieval actions but does not touch the more
complicated search actions -- those use some ofy() query searching logic
and will likely end up being significantly more complicated than this
change. Here, though, we are just changing the calls that can be
converted easily to tm() lookups.
To change in future PRs:
- RdapDomainSearchAction
- RdapEntitySearchAction
- RdapNameserverSearchAction
- RdapSearchActionBase
* Update NPM plugin and hardcode versions of Node / NPM to use
The plugin we were using before was a bit old (last updated in March
2019) and this one is newer, updated, and updates the package-lock file
with the new dependency upgrades
* Refactor PremiumList storage and retrieval for dual-database setup
Previously, the storage and retrieval code was scattered across various
places haphazardly and there was no good way to set up dual database
access. This reorganizes the code so that retrieval is simpler and it
allows for dual-write and dual-read.
This includes the following changes:
- Move all static / object retrieval code out of PremiumList -- the
class should solely consist of its data and methods on its data and it
shouldn't have to worry about complicated caching or retrieval
- Split all PremiumList retrieval methods into PremiumListDatastoreDao
and PremiumListSqlDao that handle retrieval of the premium list entry
objects from the corresponding databases (since the way the actual data
itself is stored is not the same between the two
- Create a dual-DAO for PremiumList retrieval that branches between
SQL/Datastore depending on which is appropriate -- it will read from
and write to both but only log errors for the secondary DB
- Cache the mapping from name to premium list in the dual-DAO. This is a
common code path regardless of database so we can cache it at a high
level
- Cache the ways to go from premium list -> premium entries in the
Datastore and SQL DAOs. These caches are specific to the corresponding
DB and should thus be stored in the corresponding DAO.
- Moves the database-choosing code from the actions to the lower-level
dual-DAO. This is because we will often wish to access this premium list
data in flows and all accesses should use the proper DB-selecting code
* Properly set up JPA in BEAM workers
Sets up a singleton JpaTransactionManger on each worker JVM for all
pipeline nodes to share.
Also added/updated relevant dependencies. The BEAM SDK version change
caused the InitSqlPipeline's graph to change.
* Fix ContactTransferData SQL loads
ContactTransferData is currently loaded back from SQL as an unspecialized
TransferData object. Replace it with the ContactTransferData object that we
use it to reconstitute.
It's likely that this could be done more straightforwardly with a schema
change.
* Changes requested in review
* Fix obscure bug when checking restore prices of duplicate domain names
There were instances of "java.lang.IllegalArgumentException: Multiple entries
with same key" in the logs, caused by attempting to construct an ImmutableMap
containing duplicate keys. It turns out this was happening in the domain check
flow when the following conditions were all simultaneously met:
1. The older v06 fee extension is used
2. The same domain name is being queried multiple times in a single check
command (which is valid per the spec but doesn't actually make any sense)
3. Said domain exists
4. The cost of a restore (an uncommon operation) is being checked
When all of those conditions were met, an error was being thrown when the
dupe-containing list of domain names was used as the keys of a new Map. This
fixes that bug by calling .distinct() first.
Give enough registrars enough typewriters ...
BUG=179052195
* Revert BEAM pipeline back to SQL credential file
Stop using the SecretManager for SQL credential in BEAM for now. The
SecretManager cannot be injected into the code on pipeline workers
because RegistryEnvironment is not set.
See b/179839014 for details.
* Add db-compare tests to three more flows
Add database comparison to the replay tests for DomainDeleteFlowTest,
DomainRenewFlowTest and DomainUpdateFlowTest.
* Add databaseTransitionSchedule entitiy
* add UpdateDatabaseTransitionScheduleCommand
* small fixes
* change entity structure to no longer be singleton
* add get command
* fix getCache
* Change id to TransitionId enum
* more fixes
* Cleanup tests
* Add link to javadoc
* Add lastUpdateTime
* fix datatype of getCached
* Add a presubmit check to require use of templated SQL string literals
This PR proposes a coding style convention that helps prevent
SQL-injection attacks, and is easy to enforce in the presubmit check.
SQL-injections can be effectively prevented if all parameterized queries
are generated using the proper param-binding methods. In our project
which uses Hibernate exclusively, this can be achieved if we all follow
a simple convention: only use constant sql templates assigned to static
final String variables as the first parameter to creat(Native)Query
methods.
This PR adds a presubmit check to enforce the proposed rule, and
modified one class as a demo. If the team agrees with this proposal, we
will change all other use cases.
* Make BiqueryPollJobAction endpoint internal only
This endpoint makes use of java object deserialization, which allows a
malicious actor to craft a request that can initiate overly broad actions on
the server. Since this endpoint is not widely used for operational purposes,
limit its authorization to "internal only" so that no user agents (even with
admin privs) can access it.
* Add start date for cert enforcement in production
* Add TODO to remove start date check after start date
* revert changes to package-lock.json
* Make start time a constant
* Wire up DeleteExpiredDomainsAction so that it can actually be called
For now I'm just going to be calling it manually (and on sandbox for starters),
but in a few weeks, if all looks good, I'll add the cron job to regularly call
it in production, and this feature will thus be done.
* Use END_OF_TIME as sentinel value for domain's autorenewEndTime
Datastore inequality queries don't work correctly for null; null is treated as
the lowest value possible which is definitely the opposite of the intended
meaning here.
This includes an @OnLoad for backfilling purposes using the ResaveAll mapreduce.
* Make cross database comparison recursive
Cross-database comparison was previously just a shallow check: fields marked
with DoNotCompare on nested objects were still compared. This causes problems
in some cases where there are nested immutable objects.
This change introduces recursive comparison. It also provides a
hasCorrectHashCode() method that verifies that an object has not been mutated
since the hash code was calculated, which has been a problem in certain cases.
Finally, this also fixes the problem of objects that are mutated in multiple
transactions: we were previously comparing against the value in datastore, but
this doesn't work in these cases because the object in datastore may have
changed since the transaction that we are verifying. Instead, check against
the value that we would have persisted in the original transaction.
* Changes requested in review
* Converted check method interfaces
Per review discussion, converted check method interface so that they
consistently return a ComparisonResult object which encapsulates a success
indicator and an optional error message.
* Another round of changes on ImmutableObjectSubject
* Final changes for review
Removed unnecessary null check, minor reformatting.
(this also removes an obsolete nullness assertion from an earlier commit that
should have been fixed in the rebase)
* Try removing that nullness check import again....
* Convert certificate strings to certificates
* Format fixes
* Revert "Format fixes"
This reverts commit 26f88bd313.
* Revert "Convert certificate strings to certificates"
This reverts commit 6d47ed2861.
* Convert strings to certs for validation
* Add clarification comments
* Add test to verify endoded cert from proxy
* Add some helper methods
* add tests for PEM with metadata
* small changes
* replace .com with .test
* Add clientCertificate to TlsCredentials.toString()
FlowRunner.run() logs these credentials to the GAE logs by implicitly using the
toString() method, so we need to add it if we want it to appear in the logs.
* Use CertificateChecker on login
* Add actual enforcement of requirements in sandbox
* Add new Exceptions
* add validation command to RegistryToolComponent
* Fix error messages
* Add a test for production behavior
* check logs in test
* move loghandler
* Convert (most) HistoryEntry ofy calls to tm
As part of this change, it was necessary to do changes in the JPATM that
are similar (but the opposite) of the changes that we did in
DatastoreTM with regards to converting HistoryEntries to and from the
*History classes.
We leave the ofy() calls in the MapReduce ResaveAllHistoryEntriesAction
for now; that can be converted during the Beam pipeline transition.
Some other tests required registrar-name fixes as well -- because
*History objects have a foreign key on the Registrar table, we have to
use a "real" registrar name in tests.
* Add simple HistoryEntryDaoTest
* Require an override flag to allow updating pending delete domains
Needing to update pending delete domains is an uncommon situation, yet currently
we are allowing superusers to do so without any extra validation (which has led
to errors). This adds a new override flag to gate the update of pending delete
domains; without it, the update will fail.
isEmpty() is not available in the version of Java GAE uses and is throwing
runtime errors (!!). I think these got into our codebases because people don't
have the language version set correctly in IntelliJ; they show as outright
errors for me (I'm on language level 8).
* Add object comparison to replay tests
Allow optional object comparison in the replay test extension and enable it
for the DomainCreateFlow test.
To faciliate this, add two new field annotations to ImmutableObject:
DoNotCompare, to be used for fields that are not relevant to either database,
and Insignificant, to be used for fields that are mutated after they have been
accessed and therefore violate immutability (there is currently only one of
these, however we might discover more in the course of adding more comparisons
to the replay test.
* Revert commented out premium price error log
* Added static create methods for ReplayExtension
* Update the Datastore to SQL migration pipeline
The pipeline now includes all entity types to be migrated by it, and has
completed successfully using the Sandbox data set. The running time in Sandbox
is about 3 hours, extrapolating by entity count to a 12-hour run with
production data. However, actual running time is likely to be longer since
throughput is lower with domains, which accounts for a higher percentage
of the total in production. More optimization will be needed.
The migrated data has not been validated.
* Use better null-handling around registrar certificates
Now with Optional it's always very clear whether they do or do not have values.
isNullOrEmpty() shouldn't be necessary anymore (indeed it wasn't necessary prior
to this either, as the relevant setters in the Registrar builder already coerced
empty strings to null). And also the cert hash is a required HTTP header, so it
will error out in the Dagger component if null or empty long before getting to
any other code.
* Merge branch 'master' into optional-get-certs
* Allow BEAM pipeline to choose JDBC isolation levels
Some BEAM pipelines may only perform READ-ONLY (e.g., reporting) or
blind-write (datastore to sql data migration) operations, which do not
need the default TRANSACTION_SERIALIZABLE isolation level. In such
cases, a less strict level allows better performance.
* Refactor naming and behavior of bulk load methods in TransactionManager
The contract of loadByKeys(Iterable<VKey>) specifies that the method will
throw a NoSuchElementException if any of the specified keys don't exist.
We don't do that before this PR, but now we do.
Existing calls (when necessary) were converted to the new load*
methods, which have the same behavior as the previous methods.
Existing methods were also renamed to be more clear -- see b/176239831
for more details and discussion.
* Add unique constraints on domain_hosts
Add unique constraints on DomainHost (child of DomainBase) and
DomainHistoryHost (child of DomainHistory). DomainHost is non-entity
embedded object and Hibernate does not define indexes automatically.
This should improve read and write performance of the parent entities.
* Add a fast mode to the ResaveAllEppResourcesAction mapreduce
This new mode avoids writing no-op mutations for entities that don't actually
have any changes to write. The cronjobs use fast mode by default, but manual
invocations do not, as manual invocations are often used to trigger @OnLoad
migrations, and fast mode won't pick up on those changes.
* Convert AllocationToken-related classes to tm()
For the most part this is a fairly simple converstion -- changing Key
references to VKey references, using JPA transactions when necessary,
and using the TransactionManager interface. There's a bit of cleanup too
in related code
* Use PollMessageVKey to replace VKey<PollMessage> in DomainBase
* Revert changes to DomainContent
* Use BillingVKey in GracePeriodBase to restore symmetric vkey
* Rebase on HEAD
* Validate SQL credentials in Secret Manager
Load SQL credentials from the SecretManager and compare them with the
ones currently in use in Nomulus server, beam pipeline, and the registry
tool. Normal operations are not affected by failures related to the
SecretManager, be it IOException, insufficient permission , or wrong or
missing credential.
The appengine and compute engine default service accounts must be
granted the permission to access the secret data. In the short term, we
will grant the secretmanager.secretAccessor role to these accounts. In
the long term, with the proposed privilege service, access will be granted
on per-secret basis.
For some reason, our docker build image has started using a non-utf8 default
encoding. Specify the encoding explicitly on python "open()" to override.
Note that this might not entirely fix the build: it's possible that this
problem may affect other portions of the build.
* Modify SignedMarkRevocationList to not swallow CloudSQL failures in unittests
* restore package-lock.json
* Added suppressExceptionUnlessInTest()
* Add a DatabaseMigrationUtils class
* small changes
This parses through all pre-existing Spec11 files in GCS (starting at
2019-01-01 which is basically when the new format started) and maps them
to the new Spec11ThreatMatch objects.
Because the old format stored domain names only and the new format stores
names + repo IDs, we need to retrieve the DomainBase objects from the
point in time of the scan (failing if they don't exist). Because the
same domains appear multiple times (we estimate a total of 100k+ entries
but only 1-2k unique domains) we cache the DomainBase objects that we
retrieve from Datastore.
* Allow disabling UpdateAutoTimestamp updates
Allow us to disable timestamp updates within a try-with-resources block for a
given thread. This functionality will be needed for transaction replays both
to and from datastore.
As part of this, also upgrade the UpdateAutoTimestampTest to a
DualDatabaseTest so we can verify that the functionality works both on
Datastore and Cloud SQL.
* Use 10 workers instead of the default 100 for re-save all EPP resources
The intended/desired effect is to have a larger number of GCS commit log diffs
spread out over a longer period of time, with each diff itself being
significantly smaller. This should retain roughly the same amount of total work
for the async Cloud SQL replication action to have to deal with, but spread
across 10X as much time.
* Add a credential store backed by Secret Manager
Added a SqlCredentialStore that stores user credentials with one level
of indirection: for each credential, an addtional secret is used to
identify the 'live' version of the credential. This is a work in
progress and the overall design is explained in
go/dr-sql-security.
Also added two nomulus commands for credential management. They are
stop-gap measures that will be deprecated by the planned privilege
management system.
* Parameterize the serialization of objects being written to SQL
We shouldn't require that objects written to SQL during a Beam pipeline
be VersionedEntity objects -- they may be non-Objectify entities. As a
result, we should allow the user to specify what the objects are that
should be written to SQL.
Note: we will need to clean up the Spec11PipelineTest more but that can
be out of the scope of this PR.
* Overload the method and add a bit of javadoc
* Actually use the overloaded function
* Resurrect symmetric vkeys when possible.
AbstractVKeyConverter had never been updated to generate symmetric VKeys for
simple (i.e. non-composite) VKey types. Update it to default to generating a
VKey with a simple Ofy key.
With this change, composite VKeys must specify the "compositeKey = true" field
in the With*VKey annotations. Doing so creates an asymmetric (SQL only) VKey
that can triggers higher-level logic to generate a corrected VKey containing
the composite Key.
* Add an action to replay commit logs to SQL
- We import from the commit-log GCS files so that we are sure that we
have a consistent snapshot.
- We use the object weighting (moved to ObjectWeights) to verify that we
are replaying objects in the correct order.
- We add a config setting (default false) of whether or not to replay
the logs.
- The action is triggered after the export if the aforementioned config
setting is on.
* Responses to CR
- Remove triggering of replay from the export action and remove the test
changes
- Add a method to load commit log diffs by transaction
- Replay one Datastore transaction at a time, per SQL transaction
- Minor logging / comment changes
- Change ObjectWeights to EntityWritePriorities and flesh out javadoc
* More CR responses
- Use one transaction per GCS diff file
- Fix up comments minutiae
* Add a class-level javadoc
* Add a log message and some periods
* bit of formatting
* Merge remote-tracking branch 'origin/master' into replayAction
* Handle toSqlEntity rather than toSqlEntities
2020-12-03 16:50:41 -05:00
4009 changed files with 207151 additions and 218021 deletions
# Learn more about CodeQL language support at https://aka.ms/codeql-docs/language-support
steps:
- name:Checkout repository
uses:actions/checkout@v4
- name:Set Java version
uses:actions/setup-java@v4
with:
distribution:'temurin'
java-version:'21'
# Initializes the CodeQL tools for scanning.
- name:Initialize CodeQL
uses:github/codeql-action/init@v3
with:
languages:${{ matrix.language }}
# If you wish to specify custom queries, you can do so here or in a config file.
# By default, queries listed here will override any specified in a config file.
# Prefix the list here with "+" to use these queries and those in the config file.
# Details on CodeQL's query packs refer to : https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs
|[](https://storage.googleapis.com/domain-registry-kokoro/internal/index.html)|[](https://storage.googleapis.com/domain-registry-kokoro/foss/index.html)|[](https://lgtm.com/projects/g/google/nomulus/alerts/)|[](https://github.com/google/nomulus/blob/master/LICENSE)|[](https://cs.opensource.google/nomulus/nomulus)|
|[](https://storage.googleapis.com/domain-registry-kokoro/internal/index.html)|[](https://storage.googleapis.com/domain-registry-kokoro/foss/index.html)|[](https://github.com/google/nomulus/blob/master/LICENSE)|[](https://cs.opensource.google/nomulus/nomulus)|

@@ -12,16 +12,16 @@ Nomulus is an open source, scalable, cloud-based service for operating
[top-level domains](https://en.wikipedia.org/wiki/Top-level_domain) (TLDs). It
is the authoritative source for the TLDs that it runs, meaning that it is
responsible for tracking domain name ownership and handling registrations,
renewals, availability checks, and WHOIS requests. End-user registrants (i.e.
renewals, availability checks, and WHOIS requests. End-user registrants (i.e.,
people or companies that want to register a domain name) use an intermediate
domain name registrar acting on their behalf to interact with the registry.
Nomulus runs on [Google App Engine][gae] and is written primarily in Java. It is
the software that [Google Registry](https://www.registry.google/) uses to
operate TLDs such as .google, .app, .how, .soy, and .みんな. It can run any
number of TLDs in a single shared registry system using horizontal scaling. Its
source code is publicly available in this repository under the [Apache 2.0 free
and open source license](https://www.apache.org/licenses/LICENSE-2.0).
Nomulus runs on [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine)
and is written primarily in Java. It is the software that
[Google Registry](https://www.registry.google/) uses to operate TLDs such as .google,
.app, .how, .soy, and .みんな. It can run any number of TLDs in a single shared registry
system using horizontal scaling. Its source code is publicly available in this
repository under the [Apache 2.0 free and open source license](https://www.apache.org/licenses/LICENSE-2.0).
@@ -43,12 +43,6 @@ by Joshua Bloch in his book Effective Java -->
<propertyname="message"value='Your Javadocs appear to use invalid <a> tag syntax in @see tags. Please use the correct syntax: @see <a href="http(s)://your_url">url_description</a>'/>
</module>
<!-- Checks that our Ofy wrapper is used instead of the "real" ofy(). -->
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.