From 715118ddda1eb900bdac27eb2f8e3e1dd39bd269 Mon Sep 17 00:00:00 2001 From: mcilwain Date: Thu, 21 Jul 2016 13:35:25 -0700 Subject: [PATCH] Add documentation on our App Engine services and task queues ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=128098514 --- docs/app-engine-architecture.md | 164 +++++++++++++++++++++++++++++++- 1 file changed, 163 insertions(+), 1 deletion(-) diff --git a/docs/app-engine-architecture.md b/docs/app-engine-architecture.md index 7ca73cf0d..ddc7faed3 100644 --- a/docs/app-engine-architecture.md +++ b/docs/app-engine-architecture.md @@ -3,12 +3,174 @@ This document contains information on the overall architecture of the Domain Registry project as it is implemented in App Engine. -## Modules +## Services + +The Domain Registry contains three +[services](https://cloud.google.com/appengine/docs/python/an-overview-of-app-engine), +which were previously called modules in earlier versions of App Engine. The +services are: default (also called front-end), backend, and tools. Each service +runs independently in a lot of ways, including that they can be upgraded +individually, their log outputs are separate, and their servers and configured +scaling are separate as well. + +### Default service + +The default service is responsible for all registrar-facing +[EPP](https://en.wikipedia.org/wiki/Extensible_Provisioning_Protocol) command +traffic, all user-facing WHOIS and RDAP traffic, and the admin and registrar web +consoles, and is thus the most important service. If the service has any +problems and goes down or stops servicing requests in a timely manner, it will +begin to impact users immediately. Requests to the default service are handled +by the `FrontendServlet`, which provides all of the endpoints exposed in +`FrontendRequestComponent`. + +### Backend service + +The backend service is responsible for executing all regularly scheduled +background tasks (using cron) as well as all asynchronous tasks. Requests to +the backend service are handled by the `BackendServlet`, which provides all of +the endpoints exposed in `BackendRequestComponent`. These include tasks for +generating/exporting RDE, syncing the trademark list from TMDB, exporting +backups, writing out DNS updates, handling asynchronous contact and host +deletions, writing out commit logs, exporting metrics to BigQuery, and many +more. Issues in the backend service will not immediately be apparent to end +users, but the longer it is down, the more obvious it will become that +user-visible tasks such as DNS and deletion are not being handled in a timely +manner. + +The backend service is also where all MapReduces run, which includes some of the +aforementioned tasks such as RDE and asynchronous resource deletion, as well as +any one-off data migration MapReduces. Consequently, the backend service +should be sized to support not just the normal ongoing DNS load but also the +load incurred by MapReduces, both scheduled (such as RDE) and on-demand +(asynchronous contact/host deletion). + +### Tools service + +The tools service is responsible for servicing requests from the `registry_tool` +command line tool, which provides administrative-level functionality for +developers and tech support employees of the registry. It is thus the least +critical of the three services. Requests to the tools service are handled by +the `ToolsServlet`, which provides all of the endpoints exposed in +`ToolsRequestComponent`. Some example functionality that this service provides +includes the server-side code to update premium lists, run EPP commands from the +tool, and manually modify contacts/hosts/domains/and other resources. Problems +with the tools service are not visible to users. ## Task queues +[Task queues](https://cloud.google.com/appengine/docs/java/taskqueue/) in App +Engine provide an asynchronous way to enqueue tasks and then execute them on +some kind of schedule. There are two types of queues, push queues and pull +queues. Tasks in push queues are always executing up to some throttlable limit. +Tasks in pull queues remain there indefinitely until the queue is polled by code +that is running for some other reason. Essentially, push queues run their own +tasks while pull queues just enqueue data that is used by something else. Many +other parts of App Engine are implemented using task queues. For example, +[App Engine cron](https://cloud.google.com/appengine/docs/java/config/cron) adds +tasks to push queues at regularly scheduled intervals, and the +[MapReduce framework](https://cloud.google.com/appengine/docs/java/dataprocessing/) +adds tasks for each phase of the MapReduce algorithm. + +The Domain Registry project uses a particular pattern of paired push/pull queues +that is worth explaining in detail. Push queues are essential because App +Engine's architecture does not support long-running background processes, and so +push queues are thus the fundamental building block that allows asynchronous and +background execution of code that is not in response to incoming web requests. +However, they also have limitations in that they do not allow batch processing +or grouping. That's where the pull queue comes in. Regularly scheduled tasks +in the push queue will, upon execution, poll the corresponding pull queue for a +specified number of tasks and execute them in a batch. This allows the code to +execute in the background while taking advantage of batch processing. + +Particulars on the task queues in use by the Domain Registry project are +specified in the `queue.xml` file. Note that many push queues have a direct +one-to-one correspondence with entries in `cron.xml` because they need to be +fanned-out on a per-TLD or other basis (see the Cron section below for more +explanation). The exact queue that a given cron task will use is passed as the +query string parameter "queue" in the url specification for the cron task. + +Here are the task queues in use by the system. All are push queues unless +explicitly marked as otherwise. + +* `bigquery-streaming-metrics` -- Queue for metrics that are asynchronously + streamed to BigQuery in the `Metrics` class. Tasks are enqueued during EPP + flows in `EppController`. This means that there is a lag of a few seconds to + a few minutes between when metrics are generated and when they are queryable + in BigQuery, but this is preferable to slowing all EPP flows down and blocking + them on BigQuery streaming. +* `brda` -- Queue for tasks to upload weekly Bulk Registration Data Access + (BRDA) files to a location where they are available to ICANN. The + `RdeStagingReducer` (part of the RDE MapReduce) creates these tasks at the end + of generating an RDE dump. +* `delete-commits` -- Cron queue for tasks to regularly delete commit logs that + are more than thirty days stale. These tasks execute the + `DeleteOldCommitLogsAction`. +* `dns-cron` (cron queue) and `dns-pull` (pull queue) -- A push/pull pair of + queues. Cron regularly enqueues tasks in dns-cron each minute, which are then + executed by `ReadDnsQueueAction`, which leases a batch of tasks from the pull + queue, groups them by TLD, and writes them as a single task to `dns-publish` + to be published to the configured DNS writer for the TLD. +* `dns-publish` -- Queue for batches of DNS updates to be pushed to DNS writers. +* `export-bigquery-poll` -- Queue for tasks to query the success/failure of a + given BigQuery export job. Tasks are enqueued by `BigqueryPollJobAction`. +* `export-commits` -- Queue for tasks to export commit log checkpoints. Tasks + are enqueued by `CommitLogCheckpointAction` (which is run every minute by + cron) and executed by `ExportCommitLogDiffAction`. +* `export-reserved-terms` -- Cron queue for tasks to export the list of reserved + terms for each TLD. The tasks are executed by `ExportReservedTermsAction`. +* `export-snapshot` -- Cron and push queue for tasks to load a Datastore + snapshot that was stored in Google Cloud Storage and export it to BigQuery. + Tasks are enqueued by both cron and `CheckSnapshotServlet` and are executed by + both `ExportSnapshotServlet` and `LoadSnapshotAction`. +* `export-snapshot-poll` -- Queue for tasks to check that a Datastore snapshot + has been successfully uploaded to Google Cloud Storage (this is an + asynchronous background operation that can take an indeterminate amount of + time). Once the snapshot is successfully uploaded, it is imported into + BigQuery. Tasks are enqueued by `ExportSnapshotServlet` and executed by + `CheckSnapshotServlet`. +* `export-snapshot-update-view` -- Queue for tasks to update the BigQuery views + to point to the most recently uploaded snapshot. Tasks are enqueued by + `LoadSnapshotAction` and executed by `UpdateSnapshotViewAction`. +* `flows-async` -- Queue for asynchronous tasks that are enqueued during EPP + command flows. Currently all of these tasks correspond to invocations of any + of the following three MapReduces: `DnsRefreshForHostRenameAction`, + `DeleteHostResourceAction`, or `DeleteContactResourceAction`. +* `group-members-sync` -- Cron queue for tasks to sync registrar contacts (not + domain contacts!) to Google Groups. Tasks are executed by + `SyncGroupMembersAction`. +* `load[0-9]` -- Queues used to load-test the system by `LoadTestAction`. These + queues don't need to exist except when actively running load tests (which is + not recommended on production environments). There are ten of these queues to + provide simple sharding, because the Domain Registry system is capable of + handling significantly more Queries Per Second than the highest throttle limit + available on task queues (which is 500 qps). +* `lordn-claims` and `lordn-sunrise` -- Pull queues for handling LORDN exports. + Tasks are enqueued synchronously during EPP commands depending on whether the + domain name in question has a claims notice ID. +* `marksdb` -- Queue for tasks to verify that an upload to NORDN was + successfully received and verified. These tasks are enqueued by + `NordnUploadAction` following an upload and are executed by + `NordnVerifyAction`. +* `nordn` -- Cron queue used for NORDN exporting. Tasks are executed by + `NordnUploadAction`, which pulls LORDN data from the `lordn-claims` and + `lordn-sunrise` pull queues (above). +* `rde-report` -- Queue for tasks to upload RDE reports to ICANN following + successful upload of full RDE files to the escrow provider. Tasks are + enqueued by `RdeUploadAction` and executed by `RdeReportAction`. +* `rde-upload` -- Cron queue for tasks to upload already-generated RDE files + from Cloud Storage to the escrow provider. Tasks are executed by + `RdeUploadAction`. +* `sheet` -- Queue for tasks to sync registrar updates to a Google Sheets + spreadsheet. Tasks are enqueued by `RegistrarServlet` when changes are made + to registrar fields and are executed by `SyncRegistrarsSheetAction`. + ## Cron tasks ## Datastore entities ## Cloud Storage buckets + +## Web.xml + +## Cursors