The shard-aware drivers can cause a huge amount of connections to be created
when there are tens of thousands of clients. While normally the shard-aware
drivers are beneficial, in those cases they can consume too much memory.
Provide an option to disable shard awareness from the server (it is likely to
be easier to do this on the server than to reprovision those thousands of
clients).
Tests: manual test with wireshark.
Message-Id: <20190223173331.24424-1-avi@scylladb.com>
Right now Cassandra SSTables with counters cannot be imported into
Scylla. The reason for that is that Cassandra changed their counter
representation in their 2.1 version and kept transparently supporting
both representations. We do not support their old representation, nor
there is a sane way to figure out by looking at the data which one is in
use.
For safety, we had made the decision long ago to not import any
tables with counters: if a counter was generated in older Cassandra, we
would misrepresent them.
In this patch, I propose we offer a non-default way to import SSTables
with counters: we can gate it with a flag, and trust that the user knows
what they are doing when flipping it (at their own peril). Cassandra 2.1
is by now pretty old. many users can safely say they've never used
anything older.
While there are tools like sstableloader that can be used to import
those counters, there are often situations in which directly importing
SSTables is either better, faster, or worse: the only option left. I
argue that having a flag that allow us to import them when we are sure
it is safe is better than having no option at all.
With this patch I was able to successfully import Cassandra tables with
counters that were generated in Cassandra 2.1, reshard and compact their
SSTables, and read the data back to get the same values in Scylla as in
Cassandra.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190210154028.12472-1-glauber@scylladb.com>
Fixes#4010
Unless user sets this explicitly, we should try explicitly avoid
deprecated protocol versions. While gnutls should do this for
connections initiated thusly, clients such as drivers etc might
use obsolete versions.
Message-Id: <20190107131513.30197-1-calle@scylladb.com>
Despite the name, this option also controls if a warning is issued
during memtable writes.
Warning during memtable writes is useful but the option name also
exists in cassandra, so probably the best we can do is update the
description.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190125020821.72815-1-espindola@scylladb.com>
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.
Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.
Scylla now requires GCC 8 to compile.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
While we keep ordinary hints in a directory parallel to the data directory,
we decided to keep the materialized view hints in a subdirectory of the data
directory, named "view_pending_updates". But during boot, we expect all
subdirectories of data/ to be keyspace names, and when we notice this one,
we print a warning:
WARN: database - Skipping undefined keyspace: view_pending_updates
This spurious warning annoyed users. But moreover, we could have bigger
problems if the user actually tries to create a keyspace with that name.
So in this patch, we move the view hints to a separate top-level directory,
which defaults to /var/lib/scylla/view_hints, but as usual can be configured.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190107142257.16342-1-nyh@scylladb.com>
The "enable_sstables_mc_format" config item help text wants to remove itself
before release. Since scylla-3.0 did not get enough mc format mileage, we
decided to leave it in, so the notice should be removed.
Fixes#4003.
Message-Id: <20181219082554.23923-1-avi@scylladb.com>
Refs #3929
When deleting a segment, IFF we have not yet filled up all reserves,
instead of actually deleting the file, put it on a "recycle" list.
Next segment allocation will instead of creating a new one simply
rename the segment and reuse the file and its allocated space.
We rename the file twice: Once on adding to recycle list, with special
prefix so we don't mix up actual replayable segments and these. Second
when we actually re-use the file (also to ensure consecutive names).
Note that we limit the amount of recyclables, so a really stressed
application which somehow fills up the replenish queue might
cause us to still drop the segments. Could skip this but risk
getting to many files on disk.
Replay should be safe, since all entries are guarded by CRC based
on the file ID (i.e. file name). Thus replaying a recycled segment
will simply cause a CRC error in the main header and be ignored (see
previous patch).
Segments that are fully synced will have terminating zero-header (see
previous patch) so we know when to stop processing a recycled file.
If a file is the result of a mid-write crash, we will generate a CRC
processing error as "normally" in this case, when hitting partially
written block or coming to an old/new chunk boundary.
v2:
* Sync dir on rename
* auto -> const sstring&
* Allow recycling files as long as we're within disk space limits
v3:
* Use special names for files waiting for reuse
Limit message size according to the configuration, to avoid a huge message from
allocating all of the server's memory.
We also need to limit memory used in aggregate by thrift, but that is left to
another patch.
Fixes#3878.
Message-Id: <20181024081042.13067-1-avi@scylladb.com>
This flag will only be used for testing purposes until Scylla 3.o
release and will be removed once SSTables 'mc' testing is completed.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
When murmur3_ignore_msb_bits was introduced in 1.7, we set its default zero
(to avoid resharding on upgrade) and set it to 12 in the scylla.yaml template
(to make sure we get the right value for new clusters).
Now, however, things have changed:
- clusters installed before 1.7 are a small minority
- they should have resharded long ago
- resharding is much better these days
- we have more migrations from Cassandra compared to old clusters
To allow clusters that migrated using their cassandra.yaml, and to clean up
the default scylla.yaml, make the default 12.
Users upgrading from pre-1.7 clusters will need to update their scylla.yaml,
or to reshard (which is a good idea anyway).
Fixes#3670.
Message-Id: <20180808063003.26046-1-avi@scylladb.com>
Now that we have the controller, we would like to take min_threshold as
a hint. If there is nothing to compact, we can ignore that and start
compacting less than min_threshold SSTables so that the backlog keeps
reducing.
But there are cases in which we don't want min_threshold to be a hint
and we want to enforce it strictly. For instance, if write amplification
is more of a concern than space amplification.
This patch adds a YAML option that allows the user to tell us that. We will
default to false, meaning min_threshold is not strictly enforced.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
View building, enabled by default, can contain or expose issues that
prevent the node from starting. In those cases, it is necessary to
disable view building such that the node can be submitted to
maintenance operations.
Fixes#3329
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
Adds extension points to schema/sstables to enable hooking in
stuff, like, say, something that modifies how sstable disk io
works. (Cough, cough, *encryption*)
Extensions are processed as property keywords in CQL. To add
an extension, a "module" must register it into the extensions
object on boot time. To avoid globals (and yet don't),
extensions are reachable from config (and thus from db).
Table/view tables already contain an extension element, so
we utilize this to persist config.
schema_tables tables/views from mutations now require a "context"
object (currently only extensions, but abstracted for easier
further changes.
Because of how schemas currently operate, there is a super
lame workaround to allow "schema_registry" access to config
and by extension extensions. DB, upon instansiation, calls
a thread local global "init" in schema_registry and registers
the config. It, in turn, can then call table_from_mutations
as required.
Includes the (modified) patch to encapsulate compression
into objects, mainly because it is nice to encapsulate, and
isolate a little.
"
* 'calle/extensions-v5' of github.com:scylladb/seastar-dev:
extensions: Small unit test
sstables: Process extensions on file open
sstables::types: Add optional extensions attribute to scylla metadata
sstables::disk_types: Add hash and comparator(sstring) to disk_string
schema_tables: Load/save extensions table
cql: Add schema extensions processing to properties
schema_tables: Require context object in schema load path
schema_tables: Add opaque context object
config_file_impl: Remove ostream operators
main/init: Formalize configurables + add extensions to init call
db::config: Add extensions as a config sub-object
db::extensions: Configuration object to store various extensions
cql3::statements::property_definitions: Use std::variant instead of any
sstables: Add extension type for wrapping file io
schema: Add opaque type to represent extensions
sstables::compress/compress: Make compression a virtual object
"This series adds the GoogleCloudSnitch.
Fixes#1619"
* 'google-cloud-snitch-v4' of https://github.com/vladzcloudius/scylla:
config: uncomment/add the supported snitches description
tests: added gce_snitch_test
locator::gce_snitch: implementation of the GoogleCloudSnitch
locator::snitch_base: properly log the failure during the snitch startup
We have merged the I/O controller without this, but we want to integrate
the CPU and I/O controllers into one. Currently, the quota can be
statically set for the CPU controller. For now, until we gain more
experience with it we should allow a static value to override the
controller's output as well.
That is particularly important since we don't yet control some
strategies like LCS and the time-based ones. Users in the field may be
using one of those strategies with a static value for background quota.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It no longer makes sense now that we have the full scheduler +
controllers. In its lieu, we will provide an option to statically set
the controller's shares as a safe guard against us getting this wrong.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.
The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.
Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
The idea being that we should have config be a global, immutable
singleton, set up by startup/test then owned/referenced by db etc.
Extensions are read-only in this context, so init code should set it up
before handing to the config. Or keep a ref to the ext param.
Uncomment desscriptions of Ec2SnitchXXX which are supported for a long
time already.
Add the description of the new GoogleCloudSnitch.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
- hints_directory:
- This option allows defining of the directory where hints files are going
to be stored if hinted handoff is enabled.
- hinted_handoff_enabled:
- May receive either a boolean value or a list of DCs. In the later case this
will define the DCs to which Nodes hints are going to be generated.
- max_hint_window_in_ms:
- Maximum amount of milliseconds the hints are going to be generated to the Node that is DOWN.
After this time period the hints are no longer going to be generated until the Node is seen UP.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The role manager still does not interact with the rest of the system,
but the role manager is now sharded on all cores and metadata is
created.
The following metadata tables are created:
- `system_auth.roles`
- `system_auth.role_members`
The default superuser, "cassandra", is also created, but has no function.
This option regulates when exactly the single-key optimization is
considered ineffective and turned off.
The threshold is the proportion of the extra data source candidates that
can be read before the optimization is considered ineffective and
disabled. The proportion is calculated as follows:
(read_data_sources - 1) / (total_data_sources - 1)
We substract 1 from the read_data_sources and total_data_sources to
effectively measure the rate of *extra* data sources we read. This
makes sure that the proportion is meaningful even if e.g. we have only
have a total of 2 data-sources and we read only 1 (best case).
Whenever this number goes above the threshold the optimization is
disabled. The threshold is number between 0 and 1, 0 forces the
optimization off and 1 forces it on. Increase the treshold to favor
throughput over latency for single-row reads, decrease the treshold to
improve latency at the expense of throughput.
If the threshold is > 0 (it's not force disabled) and the optimization
is disabled due to a read crossing the threshold, we will issue
"probing" reads (every 100th read) to determine if the optimization is
worth re-enabling. Probing reads are allowed to run through the
optimization path and if they go below the threshold the optimization is
re-enabled.
* seastar 0083ee8...85ca12d (1):
> Merge "Run-time logging configuration" from Jesse
Includes patch from Jesse:
"Switch to Seastar for logging option handling
In addition to updating the abstraction layer for Seastar logging in `log.hh`,
the configuration system (`db/config.{hh,cc}`) has been updated in two ways:
- The string-map type for Boost.program_options is now defined in Seastar.
- A configuration value can be marked as `UsedFromSeastar`. This is like `Used`,
except the option is expected to be defined in the Boost.Program_options
description for Seastar. If the option is not defined in Seastar, or it is
defined with a different type, then a run-time exception is thrown early in
Scylla's initialization. This is necessary because logging options which are
now defined in Seastar were previously defined in Scylla and support for these
options in the YAML file cannot be dropped. In order to be able to verify that
options marked `UsedFromSeastar` are actually defined in Seastar, the
interface for adding options to `db::config` has changed from taking a
`boost::program_options::options_description_easy_init` (which is handle into
a `boost::program_options::options_description` which only allows adding
options) to taking a `boost::program_options::options_description`
directly (which also allows querying existing options).
Scylla also fully defers to Seastar's support for run-time logging
configuration."
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <ef26cffb91bef1ae95d508187a6dd861a6c4fc84.1503344007.git.jhaberku@scylladb.com>
The number of keysapce and column family metrics reported is
proportional to the number of shards times the number of keysapce/column
families.
This can cause a performance issue both on the reporting system and on
the collecting system.
This patch adds a configuration flag (set to false by default) to enable
or disable those metrics.
Fixes#2701
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170821113843.1036-1-amnon@scylladb.com>
Some (most?) users don't read logs or release notes, so they won't notice
that the ByteOrdered and Random partitioners were deprecated in 2.0. Make
them notice by refusing to start with a deprecated partitioner, unless a
switch is explicitly enabled.
Message-Id: <20170820073424.8331-1-avi@scylladb.com>
Large allocations can require cache evictions to be satisfied, and can
therefore induce long latencies. Enable the seastar large allocation
warning so we can hunt them down and fix them.
Message-Id: <20170819135212.25230-1-avi@scylladb.com>
* Configuration for cluster_name is commented-out in config file.
* Default value set to empty string and if not rewritten by user then
warning is printed and value is reset to "ScyllaDB Cluster".
Fixes#2648.
Message-Id: <20170808113322.9313-1-daniel@scylladb.com>
It specifies the maximum gossip shadow round time. It can be used to
reduce the gossip feature check time during node boot up.
For instance, when the first node in the cluster, which listed both
itself and other node as seed in the yaml config, boots up, it will try
to talk to other seed nodes which are not started yet. The gossip shadow
round will be used to fetch the feature info of the cluster. Since there
is no other seed node in the cluster, the shadow round will fail. User
can reduce the default shadow_round_ms option to reduce the boot time.
Fixes#2615
Message-Id: <10916ce9059f3c7f1a1fb465919ae57de3b67d59.1500540297.git.asias@scylladb.com>
This patch introduces a simple controller that will adjust memtables CPU
shares, trying to keep it around the soft limit: if we start going below
it means we're too fast (unless we are idle) and shares are adjusted
downwards. If we start going above it means we're too fast and shares
are adjusted upwards.
I have tested this extensively in a single-CPU setup with various
CPU-bound workloads while tracking virtual dirty and the results are
good, with virtual dirty fluctuating only slightly, somewhere within the
desired range.
Exceptions to this are:
1) when the load is very light - the idle system goes faster, and that's
ok
2) when the load is very high - as foreground requests dominate we can't
flush fast enough and hit the hard limit. However, in such scenarios
the memtable shares do hit its maximum, and the results are no worse
than they are right now and this will only be fixed by CPU-limiting the
actual requests.
This feature can be disabled with a config option - that is scheduled to
go away as we acquire more confidence in this. When the feature is
disabled, all background writers (streaming, compaction, memtables) will
share the same scheduling group, with static quotas.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In scylla, we have foreground processes, which are latency sensitive and
need to be responded to as fast as possible in order to maintain good
latency profiles, and background process, which are less so.
The most important background processes we have during normal write
workload operations are memtable writes and sstable compactions. Those
processes are quite CPU-intensive, and left unchecked will easily
dominate the CPU. Lower values of task-quota usually help, as it will
force those processes to preempt more, but aren't enough to guarantee
good isolation. We have seen boxes with good NVMe storage having their
throughput reduced to less than half of the original baseline in a short
dive down for the duration of a compaction.
In the long run, our goal is to leverage the CPU scheduler to make sure
that those processes are balanced with respect to all the others.
However, the current state of affairs is causing grievances as this very
moment. Thankfully, those processes live in a seastar::thread, that
ships with its own rudimentary bandwidth control mechanism: the
scheduling group.
The goal of this patch is to wrap background processes together in a
scheduling group, and assign to such group 50 % of our CPU power; the
remainder being left to foreground processes.
While we pride ourselves in dynamically adjusting things to the
workload, we won't be able to do this properly before the CPU scheduler
lands - and let's face it, leaving background processes run wild is not
adaptative either. Every workload would benefit most from a different
value for such shares, but 50 % is as fair as it gets if we really need
static partitining in the mean time.
As a defense against unforeseen consequences, we'll leave the actual
value as an option, but will do our best to hide it - as this is not a
tunable that we want to be part of a normal Scylla setup. The most
convenient place for this tunable is still db::config, so we can easily
pass it down to the database layer - but we will not document it in the
yaml, and will clearly note in the help string that it is not supposed
to be tuned.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It is useful for larger cluster with larger gossip message latency. By
default the fd_max_interval_ms is 2 seconds which means the
failure_detector will ignore any gossip message update interval larger
than 2 seconds. However, in larger cluster, the gossip message udpate
interval can be larger than 2 seconds.
Fixes#2603.
Message-Id: <49b387955fbf439e49f22e109723d3a19d11a1b9.1500278434.git.asias@scylladb.com>
Make the descriptions of permissions_validity_in_ms, permissions_update_interval_in_ms
and permissions_cache_max_entries more readable and more related to what they really
do.
Mention the none-zero value requirement for the permissions_update_interval_in_ms and
the permissions_cache_max_entries when the permissions cache is enabled.
Adjust the parameters description in the scylla.yaml too.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1499957053-31792-1-git-send-email-vladz@scylladb.com>
It's been linked with various performance issues, either by causing
them or making them worse. One example is #1634, and also recently
I have investigated continuous performance degradation that was also
linked to defrag on idle activity.
Until we can figure out how to reduce its impact, we should disable it.
Signed-off-by: Glauber Costa <glauber@glauber.scylladb>
Message-Id: <20170627201109.10775-1-glauber@scylladb.com>
"This series enables cache to keep partial partitions.
Reads no longer have to read whole partition from sstables
in order to cache the result.
The 10MB threshold for partition size in cache is lifted.
Known issues:
- There is no partial eviction yet, whole partitions are still evicted,
and partition snapshots held by active reads are not evictable at all
- Information about range continuity is not recorded if that
would require inserting a dummy entry, or if previous entry
doesn't belong to the latest snapshot
- Cache update after memtable flush happening concurrently with reads
may inhibit that reads' ability to populate cache (new issue)
- Cache update from flushed memtables has partition granularity,
so may cause latency problems with large partition
- Schema is still tracked per-partition, so after schema changes
reads may induce high latency due to whole partition needing
to be converted atomically
- Range tombstones are repeated in the stream for every range between
cache entries they cover (new issue)
- Populating scans for both small and large partitions (perf_fast_forward)
experienced a 40% reduction of throughput, CPU bound
How was this tested:
- test.py --mode release
- row_cache_stress_test -c1 -m1G
- perf_fast_forward, passes except for the test case checking range continuity population
which would require inserting a dummy entry (mentioned above)
- perf_simple_query (-c1 -m1G --duration 32):
before: 90k [ops/s] stdev: 4k [ops/s]
after: 94k [ops/s] stdev: 2k [ops/s]"
* tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits)
tests: row_cache: Add test_tombstone_merging_in_partial_partition test case
tests: Introduce row_cache_stress_test
utils: Add helpers for dealing with nonwrapping_range<int>
tests: simple_schema: Allow passing the tombstone to make_range_tombstone()
tests: simple_schema: Accept value by reference
tests: simple_schema: Make add_row() accept optional timestamp
tests: simple_schema: Make new_timestamp() public
tests: simple_schema: Introduce make_ckeys()
tests: simple_schema: Introduce get_value(const clustered_row&) helper
tests: simple_schema: Fix comment
tests: simple_schema: Add missing include
row_cache: Introduce evict()
tests: Add cache_streamed_mutation_test
tests: mutation_assertions: Allow expecting fragments
mutation_fragment: Implement equality check
tests: row_cache: Add test for population of random partitions
tests: row_cache: Add test for partition tombstone population
tests: row_cache: Test reading randomly populated partition
tests: row_cache: Add test_single_partition_update()
tests: row_cache: Add test_scan_with_partial_partitions
...