Commit Graph

2075 Commits

Author SHA1 Message Date
Tomasz Grabiec
d9f0c1f097 tests: cache: Fix invalidate() not being waited for
Probably responsible for occasional failures of subsequent assertion.
Didn't mange to reproduce.

Message-Id: <1520330967-584-1-git-send-email-tgrabiec@scylladb.com>
2018-03-06 12:14:04 +02:00
Duarte Nunes
0c05fc0bff tests/flush_queue_test: Don't assume continuations run immediately
This patch fixes an issue with test_propagation(), where the test
assumed that after the future returned from wait_for_pending(0)
resolved, the continuations set for the post operation had already
run, which is not true.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180305131908.7667-1-duarte@scylladb.com>
2018-03-05 15:22:33 +02:00
Avi Kivity
1dae29b48d test: mutation_reader_test: fix no-timeout case in reader_wrapper
reader_wrapper's _timeout defaults to now(), which means to time
out immediately rather than no timeout.

Fix by switching to a time_point, defaulting to no_timeout, and
provide a compatible constructor (with a duration parameter) for
callers that do want a duration-based timeout.

Tests: mutation_reader_test (debug, release)
Message-Id: <20180305111739.31972-1-avi@scylladb.com>
2018-03-05 12:40:07 +01:00
Vlad Zolotarov
e3ca390333 tests: gce_snitch_test: drop the property file related message
The message in question is printed with printf() which is bad by itself.
And most importantly this test uses a single .property file so this message
doesn't add any interesting information to begin with. Therefore it makes
more sense to drop it than to fix it.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1519661059-13325-1-git-send-email-vladz@scylladb.com>
2018-03-04 16:16:37 +02:00
Jesse Haber-Kucharsky
90af3d889a tests: Rename test for consistency
Now we have `cql_auth_query_test` and `cql_auth_syntax_test`.
2018-03-01 12:06:59 -05:00
Jesse Haber-Kucharsky
b84e22acdd cql: Elaborate error for quoted user names
Since quoted names are allowed for role names, we add a more descriptive
error message when a quoted name is (erroneously) used for a user name.

This behavior is consistent with Apache Cassandra.
2018-03-01 12:06:59 -05:00
Jesse Haber-Kucharsky
b5264d8bf7 cql: Allow role names to be string literals
This behavior matches that of Apache Cassandra. When a role name is
specified as a string literal (single quotes), the case is preserved.
2018-03-01 12:06:59 -05:00
Jesse Haber-Kucharsky
d7f2035dea cql: Make role syntax more consistent
This patch changes the syntax for CQL statements related to roles to
favor a form like

    CREATE ROLE sam WITH PASSWORD = 'shire' AND LOGIN = false;

instead of

    CREATE ROLE sam WITH PASSWORD 'shire' NOLOGIN;

This new syntax has the benefit of not imposing any ordering constraints
on the modifiers for roles and being consistent with other parts of the
CQL grammar. It is also consistent with syntax in Apache Cassandra.

The old USER-based statements (CREATE USER and ALTER USER) still have
the old forms for backwards compatibility.

A previous change modified the USER-related statements to allow for the
OPTIONS option. However, this was a mistake; only the PASSWORD option
should have been allowed. This patch also corrects this mistake.
2018-03-01 12:04:40 -05:00
Jesse Haber-Kucharsky
62bfc3939c tests: Add CQL syntax tests for access-control
These are quick-running tests for verifying the accepted forms of CQL
statements (and fragments) related to access-control: users, roles, and
permissions.

Establishing the allowed forms of statements is helpful for reference,
but also makes syntax changes (like those expected in later patches)
clearer and more safe.
2018-03-01 11:46:37 -05:00
Avi Kivity
d973445a94 Merge "sstable/schema extensions" from Calle
"
Adds extension points to schema/sstables to enable hooking in
stuff, like, say, something that modifies how sstable disk io
works. (Cough, cough, *encryption*)

Extensions are processed as property keywords in CQL. To add
an extension, a "module" must register it into the extensions
object on boot time. To avoid globals (and yet don't),
extensions are reachable from config (and thus from db).

Table/view tables already contain an extension element, so
we utilize this to persist config.

schema_tables tables/views from mutations now require a "context"
object (currently only extensions, but abstracted for easier
further changes.

Because of how schemas currently operate, there is a super
lame workaround to allow "schema_registry" access to config
and by extension extensions. DB, upon instansiation, calls
a thread local global "init" in schema_registry and registers
the config. It, in turn, can then call table_from_mutations
as required.

Includes the (modified) patch to encapsulate compression
into objects, mainly because it is nice to encapsulate, and
isolate a little.
"

* 'calle/extensions-v5' of github.com:scylladb/seastar-dev:
  extensions: Small unit test
  sstables: Process extensions on file open
  sstables::types: Add optional extensions attribute to scylla metadata
  sstables::disk_types: Add hash and comparator(sstring) to disk_string
  schema_tables: Load/save extensions table
  cql: Add schema extensions processing to properties
  schema_tables: Require context object in schema load path
  schema_tables: Add opaque context object
  config_file_impl: Remove ostream operators
  main/init: Formalize configurables + add extensions to init call
  db::config: Add extensions as a config sub-object
  db::extensions: Configuration object to store various extensions
  cql3::statements::property_definitions: Use std::variant instead of any
  sstables: Add extension type for wrapping file io
  schema: Add opaque type to represent extensions
  sstables::compress/compress: Make compression a virtual object
2018-02-26 17:15:29 +02:00
Calle Wilund
e75d3dc997 extensions: Small unit test
Test basic operation of schema and sstable extensions
2018-02-26 10:43:37 +00:00
Jesse Haber-Kucharsky
82c8104c72 cql_test_env: Ignore error if user already exists
When a `cql_test_env` points to a data directory that was previously
populated with `cql_test_env`, then the "tester" user will already
exist. This is not an error, so we can just ignore the exception.

Fixes #3224.

Tests: unit (debug)
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <7729e5a98d8020a7ed1b6d12d8726559f0850f9d.1519315698.git.jhaberku@scylladb.com>
2018-02-22 19:30:50 +01:00
Pekka Enberg
f1f691b555 Merge "Add the GoogleCloudSnitch" from Vlad
"This series adds the GoogleCloudSnitch.

 Fixes #1619"

* 'google-cloud-snitch-v4' of https://github.com/vladzcloudius/scylla:
  config: uncomment/add the supported snitches description
  tests: added gce_snitch_test
  locator::gce_snitch: implementation of the GoogleCloudSnitch
  locator::snitch_base: properly log the failure during the snitch startup
2018-02-19 15:58:56 +02:00
Paweł Dziepak
d97eebe82d tests/cql3: increase TTL to avoid spurious failures
The test inserts some values with a TTL of 1 second and then
reads them back expecting them not to be expired yet. That may not
always be the case if the machine is slow and we are running in the
debug mode. Increasising the TTLs by x100 should help avoid these
false positives.

Message-Id: <20180219133816.17452-1-pdziepak@scylladb.com>
2018-02-19 15:40:19 +02:00
Duarte Nunes
d394b30882 tests/flush_queue_test: Ensure queue is closed before being destroyed
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180217172008.27551-1-duarte@scylladb.com>
2018-02-19 13:10:28 +00:00
Duarte Nunes
294326b5b1 tests/commitlog_test: Close file
Operations on a append_challenged_posix_file_impl schedule asynchronous
operations when they are executed, which capture the file object. To
synchronize with them and prevent use-after-free, we need to call
close() before destroying the file.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180217170556.27330-1-duarte@scylladb.com>
2018-02-19 13:10:14 +00:00
Duarte Nunes
ac55210677 tests/logalloc_test: Ensure regions are reclaimed in order
This test relied on task execution order to work correctly. Namely, it
relied on parent regions being reclaimed before child regions
(reclaiming is an asynchronous process started by a call to
start_reclaiming()). This order is necessary because child regions
don't know about parent regions when calculating the biggest region
that should be reclaimed.

We fix this by forcing the reclaim order.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180217121655.26057-1-duarte@scylladb.com>
2018-02-19 13:09:59 +00:00
Duarte Nunes
4fdcd6c92f tests/serialized_action_test: Don't rely on task execution order
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216191050.21902-1-duarte@scylladb.com>
2018-02-19 13:08:58 +00:00
Duarte Nunes
03608d269e Merge 'On the road to roles' from Jesse
This series takes Scylla most of the way to supporting roles, and
eliminates old user-based code. All the old user-based CQL statements
and functionality should exist as they did before, except now they are
backed internally by roles.

While all the functionality for supporting roles should be present,
role-specific features like granting a role to another role still warn
as "unimplemented". This will continue until the next series addresses
the final touches. These remaining items are:

- A slightly revised CQL syntax consistent with Apache Cassandra's
  revised role syntax.

- A user is automatically granted permissions on resources they create.

Users running a previous version of Scylla should be able to seamlessly
upgrade to a version of Scylla with this series merged. When a newly
upgraded node starts, it detects the presence of old metadata and copies
it to the new metadata tables if no nondefault new metadata yet exists.
A new gossiper feature flag, ROLES, also ensures that access-control
data is not modified while a cluster is in a partially-upgraded state.
If, when the cluster is in a partially upgraded state, a client connects
to an un-upgraded node then likely the change will not be propogated to
the new metadata table. We will document that changes to access-control
are not supported while upgrading in order to account for both cases
(a client connecting to an upgraded and a non-upgraded node).

All unit tests pass (except those which also fail on `master`).

I've run auth-related dtests and they all pass, except for tests which
depend on the old security model and which are therefore invalid.
Upstream dtests have been updated to account for this new security model,
and I will open an appropriate pull request to to similarly update our
own version.

I have also done a test-run cluster upgrade procedure with ccm
consisting of a 3 node cluster. I began by creating the cluster from
`master` and increasing the replication factor of the `system_auth`
keyspace to 3 and repairing the nodes. I then created several users and
granted them permissions on some resources. I then stopped a node,
updated its hardlinked executable to Scylla built from this patch series
, and restarted the node. I observed the migration of legacy data
starting and finishing. Connecting to the node, I observed all the new
roles functionality was working correctly. I verified that attempting to
change access-control information failed with a message about an
upgrading cluster. I repeated the process, node by node, with the
remaining two nodes and finally observed that the entire cluster had
upgraded and that I could modify access-control information freely.  I
will encapsulate this test into a dtest if possible.

Fixes #1941.

* 'jhk/switch_to_roles/v6' of https://github.com/hakuch/scylla: (83 commits)
  cql3: Remove some unimplemented warnings
  cql3: Prevent unhandled exception for anonymous user
  auth: Add alias for set of role names
  auth: Revoke permissions on dropped role resources
  auth: Move definition to corresponding .cc file
  cql3: Fix life-time of `user` from `client_state`
  auth: Migrate legacy data on boot
  auth: Check protected resources of the role-manager
  auth: Protect authenticator resources
  service/client_state: Correct erroneous comment
  client_state: Fix error message
  cql3: Fix error handling for GRANT and REVOKE
  auth: Remove unnecessary `sstring` allocation
  cql3: Rename variables to reflect roles
  auth: Decouple authorization and role management
  auth: Add code to expand a resource family
  cql: Also add `username` col. for LIST PERMISSIONS
  cql3: Fix error handling in LIST PERMISSIONS
  auth: Change error messages to pass dtests
  cql3: Handle errors more precisely for roles
  ...
2018-02-16 13:57:29 +00:00
Tomasz Grabiec
9c3e56fb16 tests: row_cache: Improve test for snapshot consistency on eviction
Reproduces https://github.com/scylladb/scylla/issues/3215.
Message-Id: <1518710592-21925-1-git-send-email-tgrabiec@scylladb.com>
2018-02-15 16:48:23 +00:00
Tomasz Grabiec
b0b57b8143 mvcc: Do not move unevictable snapshots to cache
Commit 6ccd317 introduced a bug in partition_entry::evict() where a
partition entry may be partially evicted if there are non-evictable
snapshots in it. Partially evicting some of the versions may violate
consistency of a snapshot which includes evicted versions. For one,
continuity flags are interpreted realtive to the merged view, not
within a version, so evicting from some of the versions may mark
reanges as continuous when before they were discontinuous. Also, range
tombtsones of the snapshot are taken from all versions, so we can't
partially evict some of them without marking all affected ranges as
discontinuous.

The fix is to revert back to full eviciton, and avoid moving
non-evictable snapshots to cache. When moving whole partition entry to
cache, we first create a neutral empty partition entry and then merge
the memtable entry into it just like we would if the entry already
existed.

Fixes #3215.

Tests: unit (release)
Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>
2018-02-15 16:48:07 +00:00
Tomasz Grabiec
b3415880b2 tests: row_cache: Add test for exception safety of updates from memtable 2018-02-15 10:13:02 +01:00
Jesse Haber-Kucharsky
5be16247cc auth: Decouple authorization and role management
auth: Decouple authorization and role management

Access control in Scylla consists of three main modules: authentication,
authorization, and role-management.

Each of these modules is intended to be interchangeable with alternative
implementations. The `auth::service` class composes these modules
together to perform all access-control functionality, including caching.

This architecture implies two main properties of the individual
access-control modules:

- Independence of modules. An implementation of authentication should
  have no dependence or knowledge of authorization or role-management,
  for example.

- Simplicity of implementing the interface. Functionality that is common
  to all implementations should not have to be duplicated in each
  implementation. The abstract interface for a module should capture
  only the differences between particular implementations.

Previously, the authorization interface depended on an instance of
`auth::service` for certain operations, since it required aggregation
over all the roles granted to a particular role or required checking if
a given role had superuser.

This change decouples authorization entirely from role-management: the
authorizer now manages only permissions granted directly to a role, and
not those inherited through other roles.

When a query needs to be authorized, `auth::service::get_permissions`
first uses the role manager to check if the role has superuser. Then, it
aggregates calls to `auth::authorizer::authorize` for each role granted
to the role (again, from the role-manager) to determine the sum-total
permission set. This information is cached for future queries.

This structure allows for easier error handling and
management (something I hope to improve in the future for both the
authorizer and authenticator interfaces), easier system testing, easier
implementation of the abstract interfaces, and clearer system
boundaries (so the code is easier to grok).

Some authorizers, like the "TransitionalAuthorizer", grant permissions
to anonymous users. Therefore, we could not unconditionally authorize an
empty permission set in `auth::service` for anonymous users. To account
for this, the interface of the authorizer has changed to accept an
optional name in `authorize`.

One additional notable change to the authorizer is the
`auth::authorizer::list`: previously, the filtering happened at the CQL
query layer and depended on the roles granted to the role in question.
I've changed the function to simply query for all roles and I do the
filtering in `auth::system` in-memory with the STL. This was necessary
to allow the authorizer to be decoupled from role-management. This
function is only called for LIST PERMISSIONS (so performance is not a
concern), and it significantly reduces demand on the implementation.

Finally, we unconditionally create a user in `cql_test_env` since
authorization requires its existence.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
0ac7d9922d auth: Add code to expand a resource family
This will be useful for the next change, where it is used for
refactoring LIST PERMISSIONS.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
f4fc12fbf0 enum_set: Add iterator
Sometimes it is useful to be able to query for all the members of an
`enum_set`, rather than just add, remove, and query for membership. (The
patch following this one makes use of this in the auth. sub-system).

We use the bitset iterator in Seastar to help with the implementation.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
bbe09a4793 enum_set: Throw on bad mask
`super_enum::valid_is_valid_sequence` determines if the numeric index
corresponding to an enumeration value is valid. This is important,
because it is undefined behavior to cast an invalid index into an
enumeration value.

This function is used to check the validity of the `enum_set` mask when
an `enum_set` is constructed in `enum_set::from_mask`. If the mask has
set bits that correspond to invalid enumeration indicies, then we throw
`bad_enum_set_mask`.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
1cf6dd85fb tests: Add basic tests for enum_set
This is motivated by a small addition to `enum_set` and `super_enum`
that follows this patch.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
c1504cd4ff auth: Pass resource by const ref.
This has the dual benefit of not enforcing copying on implementations of
the abstract interface and also limiting unnecessary copies.

As usual with Seastar, we follow the convention that a reference
parameter to a function is assumed valid for the duration of the
`future` that is returned. `do_with` helps here.

By adding some constants for root resources, we can avoid using
`seastar::do_with` at some call-sites involving `resource` instances.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
a3eaf9e697 auth: Remove unused "performer" argument
This argument used to be used for access-control checks, but this has
all moved to the CQL layer.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
e6363e15de auth/resource: Construct from ctor
The motivation behind this change is the idea that constructing a new
instance of an object is the job of the constructor.

One big benefit of this structure (with the addition of helpers for
convenience) is that calls for emplacing instances (like
`std::make_shared`, or `std::vector::emplace_back`) work without any
difficulty. This would not be true for static construction functions.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
de33124c39 Don't store authenticated_user in shared_ptr
All we require are value semantics.

`client_state` still stores `authenticated_user` in a `shared_ptr`, but
the behavior of that class is complex enough to warrant its own
discussion/design/refactor.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
e11de26d50 auth: Simplify authenticated_user interface
The most important change is replacing `auth::authenticated_user::name`
with a public `std::optional<sstring>` member. Anonymous users have no
name. This replaces the insecure and bug-prone special-string of
"anonymous" for anonymous users, which does unfortunate things with the
authorizer.

The new `auth::is_anonymous` function exists for convenience since
checking the absence of a `std::optional` value can be tedious.

When a caller really wants a name unconditionally, a new stream output
function is also available.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
741d215516 auth: Switch to roles from users
This is a large change, but it's a necessary evil.

This change brings us to a minimally-functional implementation of roles.
There are many additional changes that are necessary, including refined
grammar, bug fixes, code hygiene, and internal code structure changes.
In the interest of keeping this patch somewhat read-able, those changes
will come in subsequent patches. Until that time, roles are still marked
"unimplemented".

IMPORTANT: This code does not include any mechanism for transitioning a
cluster from user-based access-control to role-based access control. All
existing access-control metadata will be ignored (though not deleted).

Specific changes:

- All user-specific CQL statements now delegate to their roles
  equivalent. The statements are effectively the same, but CREATE USER
  will include LOGIN automatically. Also, LIST USERS only lists roles
  with LOGIN.

- A call to LIST PERMISSIONS will now also list permissions of roles
  that have been granted to the caller, in addition to permissions which
  have been granted directly.

- Much of the logic of creating, altering, and deleting roles has been
  moved to `auth::service`, since these operations require cooperation
  between the authenticator, authorizer, and role-manager.

- LIST USERS actually works as expected now (fixes #2968).
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
34280c18bb tests: Rename helper function for clarity 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
b3dc90d5d2 auth: Refactor authentication options
The set of allowed options is quite small, so we benefit from a static
representation (member variables) over a dynamic map.

We also logically move the "OPTIONS" option to the domain of the
authenticator (from user management), since this is where it is applied.

This refactor also aims to reduce compilation time by moving
`authentication_options` into its own header file.

While changes to `user_options` were necessary to accommodate the new
structure, that class will be deprecated shortly in the switch to roles.
Therefore, the changes are strictly temporary.
2018-02-14 14:15:57 -05:00
Tomasz Grabiec
1039850515 tests: flat_reader_assertions: Improve failure message 2018-02-14 16:42:49 +01:00
Tomasz Grabiec
74986f31e8 tests: Disable failure injection around background compactor
Failure could be injected into the compactor if the main code under
test defers before reaching allocation failure point, and compactor
gets hit. This is not what the test is supposed to stress, and it
causes abort when memtable_snapshot_source is destroyed, so disable
failure injection there.
2018-02-14 16:42:49 +01:00
Duarte Nunes
ac6abf8021 Merge 'CQL clustering column secondary indexing support' from Pekka
"This patch series adds support for clustering column secondary indexing.

Fixes #2961

Tests: unit-tests (release)"

* 'penberg/cql-2i-clustering-key-indexing/v2' of github.com:penberg/scylla:
  tests/cql_query_test: Add indexed clustering key query test
  cql3: Fix clustering column secondary indexing
  cql3/statements: Add values() helper to restrictions
  cql3/restrictions: Fix multi_column_restriction::values()
  cql3/restrictions: Fix single_column_primary_key_restrictions::values()
2018-02-12 18:49:34 +00:00
Avi Kivity
e77ecda1da tests: avoid signed/unsigned compares
Container indices are size_t, and in other places we gratuituously
declare a limit as unsigned and the loop index as signed.

Tests: unit (release)
Message-Id: <20180212121642.10525-1-avi@scylladb.com>
2018-02-12 12:25:21 +00:00
Avi Kivity
3f5a8229ac tests: fix for sstable::get_index_reader() removal
71495691aa removed sstable::get_index_reader(),
but forgot to update its callers in tests/.  Update the callers to construct
a temporary shared_index_list and create the index_reader directly.

This is none too clean, but shared_index_lists needs to be retired, and then
the changes in this patch can go away too.

Tests: unit (release)
Message-Id: <20180211164739.17862-1-avi@scylladb.com>
2018-02-11 17:53:08 +00:00
Avi Kivity
432268f582 Merge "branch 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla" from Raphael
"The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.
The manager was needed for orchestrating deletion of shared sstable
across shards. It brings extra complexity that's not longer needed,
and it was also overloading shard 0, but the latter could have
been fixed.

Tests:
- unit: release mode
- dtest: resharding_test.py"

* 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla:
  Remove SSTable's atomic deletion manager
  Stop using SSTable's atomic deletion manager
  database: split column_family::rebuild_sstable_list
2018-02-08 19:10:16 +02:00
Avi Kivity
404172652e Merge "Use xxHash for digest instead of MD5" from Duarte
"This series changes digest calculation to use a faster algorithm
(xxHash) and to also cache calculated cell hashes that can be kept in
memory to speed up subsequent digest requests.

The MD5 hash function has proved to be slow for large cell values:

size = 256; elapsed = 4us
size = 512; elapsed = 8us
size = 1024; elapsed = 14us
size = 2048; elapsed = 21us
size = 4096; elapsed = 33us
size = 8192; elapsed = 51us
size = 16384; elapsed = 86us
size = 32768; elapsed = 150us
size = 65536; elapsed = 278us
size = 131072; elapsed = 531us
size = 262144; elapsed = 1032us
size = 524288; elapsed = 2026us
size = 1048576; elapsed = 4004us
size = 2097152; elapsed = 7943us
size = 4194304; elapsed = 15800us
size = 8388608; elapsed = 31731us
size = 16777216; elapsed = 64681us
size = 33554432; elapsed = 130752us
size = 67108864; elapsed = 263154us

The xxHash is a non-cryptographic, 64bit (there's work in progress on
the 128 version) hash that can be used to replace MD5. It performs much
better:

size = 256; elapsed = 2us
size = 512; elapsed = 1us
size = 1024; elapsed = 1us
size = 2048; elapsed = 2us
size = 4096; elapsed = 2us
size = 8192; elapsed = 3us
size = 16384; elapsed = 5us
size = 32768; elapsed = 8us
size = 65536; elapsed = 14us
size = 131072; elapsed = 28us
size = 262144; elapsed = 59us
size = 524288; elapsed = 116us
size = 1048576; elapsed = 226us
size = 2097152; elapsed = 456us
size = 4194304; elapsed = 935us
size = 8388608; elapsed = 1848us
size = 16777216; elapsed = 4723us
size = 33554432; elapsed = 10507us
size = 67108864; elapsed = 21622us

Performance was tested using a 3 node cluster with 1 cpu and 8GB,
and with the following cassandra-stress loaders. Measurements are for
the read workload.

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 32699 [READ:32699]
partition rate            : 32699 [READ:32699]
row rate                  : 32699 [READ:32699]
latency mean              : 3.0 [READ:3.0]
latency median            : 3.0 [READ:3.0]
latency 95th percentile   : 3.9 [READ:3.9]
latency 99th percentile   : 4.5 [READ:4.5]
latency 99.9th percentile : 6.6 [READ:6.6]
latency max               : 24.0 [READ:24.0]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:05
END

md5:

Results:
op rate                   : 25241 [READ:25241]
partition rate            : 25241 [READ:25241]
row rate                  : 25241 [READ:25241]
latency mean              : 3.9 [READ:3.9]
latency median            : 3.9 [READ:3.9]
latency 95th percentile   : 5.1 [READ:5.1]
latency 99th percentile   : 5.8 [READ:5.8]
latency 99.9th percentile : 8.0 [READ:8.0]
latency max               : 24.8 [READ:24.8]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:06:36
END

This translates into a 21% improvoment for this workload.

Bigger cell values were also tested:

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 19964 [READ:19964]
partition rate            : 19964 [READ:19964]
row rate                  : 19964 [READ:19964]
latency mean              : 4.9 [READ:4.9]
latency median            : 4.6 [READ:4.6]
latency 95th percentile   : 7.2 [READ:7.2]
latency 99th percentile   : 11.5 [READ:11.5]
latency 99.9th percentile : 13.6 [READ:13.6]
latency max               : 29.2 [READ:29.2]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:08:20
END

md5:

Results:
op rate                   : 12773 [READ:12773]
partition rate            : 12773 [READ:12773]
row rate                  : 12773 [READ:12773]
latency mean              : 7.7 [READ:7.7]
latency median            : 7.3 [READ:7.3]
latency 95th percentile   : 10.2 [READ:10.2]
latency 99th percentile   : 16.8 [READ:16.8]
latency 99.9th percentile : 19.2 [READ:19.2]
latency max               : 71.5 [READ:71.5]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:13:02
END

This translates into a 37% improvoment for this workload.

Fixes #2884

Tests: unit-tests (release), dtests (smp=2)

Note: dtests are kinda broken in master (> 30 failures), so take the
tests tag with a grain of himalayan salt."

* 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits)
  tests/row_cache_test: Test hash caching
  tests/memtable_test: Test hash caching
  tests/mutation_test: Use xxHash instead of MD5 for some tests
  tests/mutation_test: Test xx_hasher alongside md5_hasher
  schema: Remove unneeded include
  service/storage_proxy: Enable hash caching
  service/storage_service: Add and use xxhash feature
  message/messaging_service: Specify algorithm when requesting digest
  storage_proxy: Extract decision about digest algorithm to use
  cache_flat_mutation_reader: Pre-calculate cell hash
  partition_snapshot_reader: Pre-calculate cell hash
  query::partition_slice: Add option to specify when digest is requested
  row: Use cached hash for hash calculation
  mutation_partition: Replace hash_row_slice with appending_hash
  mutation_partition: Allow caching cell hashes
  mutation_partition: Force vector_storage internal storage size
  test.py: Increase memory for row_cache_stress_test
  atomic_cell_hash: Add specialization for atomic_cell_or_collection
  query-result: Use digester instead of md5_hasher
  range_tombstone: Replace feed_hash() member function with appending_hash
  ...
2018-02-08 18:24:58 +02:00
Tomasz Grabiec
cce1a2bce8 Merge "Use the CPU scheduler" from Glauber & Avi
In this patchset I am resubmitting Avi's enablement of the CPU scheduler
in his behalf. I've done a ton of testing in the series and there are
some improvements / changes that I had previously sent as a separate series.

What you see here is the result of merging that work.

After this patchset is applied, workloads are smoother and we are able to
uphold the pre-defined shares among the various actors.

We also finally have everything we need to merge the CPU and I/O controllers.
After that is done the code is now much simpler. But also, as a bonus,
controllers that were previously available for I/O only (compactions) are
enabled for CPU as well.

* git@github.com:glommer/scylla.git cpusched-v7:

Avi Kivity (4):
  database, sstables, compaction: convert use of thread_scheduling_group
    to seastar cpu scheduler
  memtable, database: make memtable::clear_gently() inherit
    scheduling_group
  config: mark background_writer_scheduling_quota as Unused
  database: place data_query execution stage into scheduling_group

Glauber Costa (9):
  database, main: set up scheduling_groups for our main tasks
  row_cache: actually use the scheduling group for update_cache
  allow update_cache and clear_gently to use the entire task quota.
  database: remove cpu_flush_quota metric
  controllers: retire auto_adjust_flush_quota
  controllers: allow memtable I/O controller to have shares statically
    set
  controllers: update control points for memtable I/O controller
  controllers: allow a static priority to override the controller output
  controllers: unify the I/O and CPU controllers
2018-02-08 15:58:40 +01:00
Raphael S. Carvalho
312bd9ce25 Remove SSTable's atomic deletion manager
Not used anymore, can be deleted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:38:45 -02:00
Avi Kivity
641aaba12c database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.

The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.

Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
2018-02-07 17:19:29 -05:00
Glauber Costa
98549775fa sstable_tests: make sure min_threshold is set explicitly
The SSTable tests are a bit fragile now because they rely on min_threshold
having a particular value. That is the default value, but if I change that
default - which I am planning to do - the test breaks.

Right now the test is not broken, but if we are planning on relying on a
property having a particular value in tests, we should explicitly set it.

So I am proactively chaning min_threshold in the tests to have the value
of 4 explicitly, so we can change that in the future without breaking anything.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180207155513.12498-1-glauber@scylladb.com>
2018-02-07 18:45:52 +01:00
Calle Wilund
2b56bbfa7d schema_tables: Require context object in schema load path
Requires "workaround" fix for schema_registry and frozen_mutation, since
the former is a free-float thread local, and the latter is a pure data
carrier. frozen_schema can take a parameter for unfreeze, but schema
registry requires being told which the system extensions are.
2018-02-07 10:11:46 +00:00
Calle Wilund
74758c87cd sstables::compress/compress: Make compression a virtual object
Make a "compressor" an actual class, that can be implemented and
registered via class registry. 

For "common" compressors, the objects will be shared, but complex
implementors can be semi-stateful. 

sstable compression is split into two parts: The "static" config
which is shared across shards, and a "local" one, which holds 
a compressor pointer. The latter is encapsulated, along with 
actual compressed data writers, in sstables/compress.cc.

For compression (write), compression writer is instansiated 
with the settings active in table metadata. 

For decompression (read), compression reader is instansiated
with the settings stored in sstable metadata, which can 
differ from the currently active table metadata. 

v2:
* Structured patch sets differently (dependencies)
* Added more comments/api descs
* Added patch to move all sstable compression into compress.cc,
  effectively separating top-level virtual compressor object
  from sstable io knowledge
v3:
* Rebased
v4: 
* Moved all sstable compression logic/knowledge into  
  compress.cc (local compression). Merged the two patches 
  (separation just confuses reader).
2018-02-07 10:11:45 +00:00
Pekka Enberg
3e4c6cc4da tests/cql_query_test: Add indexed clustering key query test 2018-02-06 16:57:27 +02:00
Paweł Dziepak
6ccd317c38 Merge "Do not evict from memtable snapshots" from Tomasz
"When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.

Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.

Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry. Found during code review.

Introduced in ca8e3c4, so affects 2.1+

Fixes #3186.

Tests: unit (release)"

* 'tgrabiec/do-not-evict-memtable-snapshots' of github.com:tgrabiec/scylla:
  tests: mvcc: Add test for eviction with non-evictable snapshots
  mutation_partition: Define + operator on tombstones
  tests: mvcc: Check that partition is fully discontinuous after eviction
  tests: row_cache: Add test for memtable readers surviving flush and eviction
  memtable: Make printable
  mvcc: Take partition_entry by const ref in operator<<()
  mvcc: Do not evict from non-evictable snapshots
  mvcc: Drop unnecessary assignment to partition_snapshot::_version
  tests: Use partition_entry::make_evictable() where appropriate
  mvcc: Encapsulate construction of evictable entries
2018-02-06 14:46:24 +00:00