"
After data segregation feature, anything that cause out-of-order writes,
like read repair, can result in small updates to past time windows.
This causes compaction to be very aggressive because whenever a past time
window is updated like that, that time window is recompacted into a
single SSTable.
Users expect that once a window is closed, it will no longer be written
to, but that has changed since the introduction of the data segregation
future. We didn't anticipate the write amplification issues that the
feature would cause. To fix this problem, let's perform size-tiered
compaction on the windows that are no longer active and were updated
because data was segregated. The current behavior where the last active
window is merged into one file is kept. But thereafter, that same
window will only be compacted using STCS.
Fixes#6928.
"
* 'fix_twcs_agressiveness_after_data_segregation_v2' of github.com:raphaelsc/scylla:
compaction/twcs: improve further debug messages
compaction/twcs: Improve debug log which shows all windows
test: Check that TWCS properly performs size-tiered compaction on past windows
compaction/twcs: Make task estimation take into account the size-tiered behavior
compaction/stcs: Export static function that estimates pending tasks
compaction/stcs: Make get_buckets() static
compact/twcs: Perform size-tiered compaction on past time windows
compaction/twcs: Make strategy easier to extend by removing duplicated knowledge
compaction/twcs: Make newest_bucket() non-static
compaction/twcs: Move TWCS implementation into source file
Contains patch from Rafael to fix up includes.
* seastar c872c3408c...7f7cf0f232 (9):
> future: Consider result_unavailable invalid in future_state_base::ignore()
> future: Consider result_unavailable invalid in future_state_base::valid()
> Merge "future-util: split header" from Benny
> docs: corrected some text and code-examples in streaming-rpc docs
> future: Reduce nesting in future::then
> demos: coroutines: include std-compat.hh
> sstring: mark str() and methods using it as noexcept
> tls: Add an assert
> future: fix coroutine compilation
The current log prints one log entry for each window, it doesn't print
the # of SSTs in the bucket, and the now information is copied across
all the window entries.
previously, it looked like this:
[shard 0] compaction - Key 1597331160000000, now 1597331160000000
[shard 0] compaction - Key 1597331100000000, now 1597331160000000
[shard 0] compaction - Key 1597331040000000, now 1597331160000000
[shard 0] compaction - Key 1597330980000000, now 1597331160000000
this made it harder to group all windows which reflect the state of
the strategy in a given time.
now, it looks like as follow:
[shard 0] compaction - time_window_compaction_strategy::newest_bucket:
now 1597331160000000
buckets = {
key=1597331160000000, size=1
key=1597331100000000, size=2
key=1597331040000000, size=1
key=1597330980000000, size=1
}
Also the level of this log is changed from debug to trace, given that
now it's compressed and only printed once.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The task estimation was not taking into account that TWCS does size-tiered
on the the windows, and it only added 1 to the estimation when there
could be more tasks than that depending on the amount of SSTables in
all the existing size tiers.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That will be useful for allowing other compaction strategies that use
STCS to properly estimate the pending tasks.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
STCS will export a static function to estimate pending tasks, and
it relies on get_buckets() being static too.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
One of the most common code I write when investigating coredumps is
finding all objects of a certain type and creating a human readable
report of certain properties of these objects. This usually involves
retrieving all objects with a vptr with `find_vptrs()` and matching
their type to some pattern. I found myself writing this boilerplate over
and over again, so in this patch I introduce a convenience method to
avoid repeating it in the future.
Message-Id: <20200818145247.2358116-1-bdenes@scylladb.com>
scylla_fiber._walk() has an early return condition on the passed-in
pointer actually being a task pointer. The problem is that the actual
return statement was missing and only an error was logged. This resulted
in execution continuing and further weird errors being printed due to
the code not knowing how to handle the bad pointer.
Message-Id: <20200818144902.2357289-1-bdenes@scylladb.com>
thread_wake_task was moved into an anonymous namespace, add this new
fully qualified name to the task name white-list. Leave the old name for
backward compatibility.
While at it, also add `seastar::thread_context` which is also a task
object, for better seastar thread support.
Message-Id: <20200818142206.2354921-1-bdenes@scylladb.com>
Accessing the element of a tuple from the gdb command line is a
nightmare, add support to collection_element() retrieving one of its
elements to make this easier.
Message-Id: <20200818141123.2351892-1-bdenes@scylladb.com>
Currently we assign the reference to the vector of selected sstables to
`auto sst`. This makes a copy and we pass this local variable to
`do_for_each()`, which will result in a use-after-free if the latter
defers.
Fix by not making a copy and instead just keep the reference.
Fixes: #7060
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200818091241.2341332-1-bdenes@scylladb.com>
"
operator_type is awkward because it's not copyable or assignable. Replace it with a new enum class.
Tests: unit(dev)
"
* dekimir-operator-type:
cql3: Drop operator_type entirely
cql3: Drop operator_type from the parser
cql3/expr: Replace operator_type with an enum
Replace operator_type with the nicer-behaved oper_t in CQL parser and,
consequently, in the relation hierarchy and column_condition.
After this, no references to operator_type remain in live code.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
operator_type is awkward because it's not copyable or assignable.
Replace it in expression representation with a new enum class, oper_t.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
As suggested by Avi, let's move the tarballs from
"build/dist/<mode>/tar" to "build/<mode>/dist/tar" to retain the
symmetry of different build modes, and make the tarballs easier to
discover. While at it, let's document the new tarball locations.
Message-Id: <20200818100427.1876968-1-penberg@scylladb.com>
fea83f6 introduced a race between processing (and hence removing)
sstables from `_sstables_with_tables` and registering new ones. This
manifested in sstables that were added concurrently with processing a
batch for the same sstables being dropped and the semaphore units
associated with them not returned. This resulted in repairs being
blocked indefinitely as the units of the semaphore were effectively
leaked.
This patch fixes this by moving the contents of `_sstables_with_tables`
to a local variable before starting the processing. A unit test
reproducing the problem is also added.
Fixes: #6892
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200817160913.2296444-1-bdenes@scylladb.com>
Except scylla-python3, each scylla package has its own git repository, same package script filename, same build directory structure.
To put python3 thing on scylla repo, we created 'python3' directory on multiple locations, made '-python3' suffixed files, dig deeper build directory not to conflict scylla-server package build.
We should move all scylla-python3 related files to new repository, scylla-python3.
To keep compatibility with current Jenkins script, provide packages on
build/ directory for now.
Fixes#6751
After data segregation feature, anything that cause out-of-order writes,
like read repair, can result in small updates to past time windows.
This causes compaction to be very aggressive because whenever a past time
window is updated like that, that time window is recompacted into a
single SSTable.
Users expect that once a window is closed, it will no longer be written
to, but that has changed since the introduction of the data segregation
future. We didn't anticipate the write amplification issues that the
feature would cause. To fix this problem, let's perform size-tiered
compaction on the windows that are no longer active and were updated
because data was segregated. The current behavior where the last active
window is merged into one file is kept. But thereafter, that same
window will only be compacted using STCS.
Fixes#6928.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
TWCS is hard to extend because its knowledge on what to do with a window
bucket is duplicated in two functions. Let's remove this duplication by
placing the knowledge into a single function.
This is important for the coming change that will perform size-tiered
instead of major on windows that are no longer active.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Previously system.paxos TTL was set as max(3h, gc_grace_seconds).
Introduce new per-table option named `paxos_grace_seconds` to set
the amount of seconds which are used to TTL data in paxos tables
when using LWT queries against the base table.
Default value is equal to `DEFAULT_GC_GRACE_SECONDS`,
which is 10 days.
This change allows to easily test various issues related to paxos TTL.
Fixes#6284
Tests: unit (dev, debug)
Co-authored-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200816223935.919081-1-pa.solodovnikov@scylladb.com>
While Alternator doesn't yet support creating a table with a different
"server-side encryption" (a.k.a. encryption-at-rest) parameters, the
SSESpecification option with Enabled=false should still be allowed, as
it is just the default, and means exactly the same as would a missing
SSESpecification.
This patch also adds a test for this case, which failed on Alternator
before this patch.
Fixes#7031.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200812205853.173846-1-nyh@scylladb.com>
This patch adds a test that attributes which serve as a key for a
secondary index still appear in the OldImage in an Alternator Stream.
This is a special case, because although usually Alternator attributes
are saved as map elements, not stand-alone Scylla columns, in the special
case of secondary-index keys they *are* saved as actual Scylla columns
in the base table. And it turns out we produce wrong results in this case:
CDC's "preimage" does not currently include these columns if they didn't
change, while DynamoDB requires that all columns, not just the changed ones,
appear in OldImage. So the test added in this patch xfails on Alternator
(and as usual, passes on DynamoDB).
Refs #7030.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200812144656.148315-1-nyh@scylladb.com>
The "user.home" system property in JVM does not use the "HOME"
environment variable. This breaks Ant and Maven builds with Podman,
which attempts to look up the local Maven repository in "/root/.m2" when
building tools, for example:
build.xml:757: /root/.m2/repository does not exist.
To fix the issue, let's bind-mount an /etc/passwd file, which contains
host username for UID 0, which ensures that Podman container $USER and $HOME
are the same as on the host.
Message-Id: <20200817085720.1756807-1-penberg@scylladb.com>
"
gossip: Get rid of seed concept
The concept of seed and the different behaviour between seed nodes and
non seed nodes generate a lot of confusion, complication and error for
users. For example, how to add a seed node into into a cluster, how to
promote a non seed node to a seed node, how to choose seeds node in
multiple DC setup, edit config files for seeds, why seed node does not
bootstrap.
If we remove the concept of seed, it will get much easier for users.
After this series, seed config option is only used once when a new node
joins a cluster.
Major changes:
Seed nodes are only used as the initial contact point nodes.
Seed nodes now perform bootstrap. The only exception is the first node
in the cluster.
The unsafe auto_bootstrap option is now ignored.
Gossip shadow round now talks to all nodes instead of just seed nodes.
Refs: #6845
Tests: update_cluster_layout_tests.py + manual test
"
* 'gossip_no_seed_v2' of github.com:asias/scylla:
gossip: Get rid of seed concept
gossip: Introduce GOSSIP_GET_ENDPOINT_STATES verb
gossip: Add do_apply_state_locally helper
gossip: Do not talk to seed node explicitly
gossip: Talk to live endpoints in a shuffled fashion
The concept of seed and the different behaviour between seed nodes and
non seed nodes generate a lot of confusion, complication and error for
users. For example, how to add a seed node into into a cluster, how to
promote a non seed node to a seed node, how to choose seeds node in
multiple DC setup, edit config files for seeds, why seed node does not
bootstrap.
If we remove the concept of seed, it will get much easier for users.
After this series, seed config option is only used once when a new node
joins a cluster.
Major changes:
- Seed nodes are only used as the initial contact point nodes.
- Seed nodes now perform bootstrap. The only exception is the first node
in the cluster.
- The unsafe auto_bootstrap option is now ignored.
- Gossip shadow round now attempts to talk to all nodes instead of just seed nodes.
Manual test:
- bootstrap n1, n2, n3 (n1 and n2 are listed as seed, check only n1
will skip bootstrap, n2 and n3 will bootstrap)
- shtudown n1, n2, n3
- start n2 (check non seed node can boot)
- start n1 (check n1 talks to both n2 and n3)
- start n3 (check n3 talks to both n1 and n3)
Upgrade/Downgrade test:
- Initialize cluster
Start 3 node with n1, n2, n3 using old version
n1 and n2 are listed as seed
- Test upgrade starting from seed nodes
Rolling restart n1 using new version
Rolling restart n2 using new version
Rolling restart n3 using new version
- Test downgrade to old version
Rolling restart n1 using old version
Rolling restart n2 using old version
Rolling restart n3 using old version
- Test upgrade starting from non seed nodes
Rolling restart n3 using new version
Rolling restart n2 using new version
Rolling restart n1 using new version
Notes on upgrade procedure:
There is no special procedure needed to upgrade to Scylla without seed
concept. Rolling upgrade node one by one is good enough.
Fixes: #6845
Tests: ./test.py + update_cluster_layout_tests.py + manual test
Since older binutils on some distribution does not able to handle
compressed debuginfo generated on Fedora, we need to disable it.
However, debian packager force debuginfo compression since debian/compat = 9,
we have to uncompress them after compressed automatically.
Fixes#6982
"
This patch series changes the build system to build all tarballs to
build/dist/<mode>/tar directory. For example, running:
./tools/toolchain/dbuild ./configure.py --mode=dev && ./tools/toolchain/dbuild ninja-build dist-tar
produces the following tarballs in build/dist/dev/tar:
$ ls -1 build/dist/dev/tar/
scylla-jmx-package.tar.gz
scylla-package.tar.gz
scylla-python3-package.tar.gz
scylla-tools-package.tar.gz
This makes it easy to locate release tarballs for humans and scripts. To
preserve backward compatibility, the tarballs are also retained in their
original locations. Once release engineering infrastructure has been
adjusted to use the new locations, we can drop the duplicate copies.
"
* 'penberg/build-dist-tar/v1' of github.com:penberg/scylla:
configure.py: Copy tarballs to build/dist/<mode>/tar directory
configure.py: Add "dist-<component>-tar" targets
reloc/python3: Add "--builddir" to build_deb.sh
configure.py: Use copy-on-write copies when possible
"
Contains several fixes which improve debuggability in situations where
too large column ids are passed to column definition loop methods.
"
* 'schema-range-check-fix' of github.com:tgrabiec/scylla:
schema: Add table name and schema version to error messages
schema: Use on_internal_error() for range check errors
schema: Fix off-by-one in column range check
schema: Make range checks for regular and static columns the same as for clustering columns
select() is too generic for the method that retrieve sstable runs,
and it has a completely different meaning that the former select
method used to select sstables based on token range.
let's give it a more descriptive name.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811193401.22749-1-raphaelsc@scylladb.com>
regression caused by 55cf219c97.
remove_by_toc_name() must work both for a sealed sstable with toc,
and also a partial sstable with tmp toc.
so dirname() should be called conditionally on the condition of
the sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200813160612.101117-1-raphaelsc@scylladb.com>
LCS can have its overlapping invariant broken after operations that can
proceed in parallel to regular compaction like cleanup. That's because
there could be two compactions in parallel placing data in overlapping
token ranges of a given level > 0.
After reshape, the whole table will be rewritten, on restart, if a
given level has more than (fan_out*2)=20 overlaps.
That may sound like enough, but that's not taking into account the
exponential growth in # of SSTables per level, so 20 overlaps may
sound like a lot for level 2 which can afford 100 sstables, but it's
only 2% of level 3, and 0.2% of level 4. So let's change the
overlapping tolerance from the constant of fan_out*2 to 10% of level
limit on # of SSTables, or fan_out, whichever is higher.
Refs #6938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200810154510.32794-1-raphaelsc@scylladb.com>
After 8014c7124, cleanup can potentially pick a compacting SSTable.
Upgrade and scrub can also pick a compacting SSTable.
The problem is that table::candidates_for_compaction() was badly named.
It misleads the user into thinking that the SSTables returned are perfect
candidates for compaction, but manager still need to filter out the
compacting SSTables from the returned set. So it's being renamed.
When the same SSTable is compacted in parallel, the strategy invariant
can be broken like overlapping being introduced in LCS, and also
some deletion failures as more than one compaction process would try
to delete the same files.
Let's fix scrub, cleanup and ugprade by calling the manager function
which gets the correct candidates for compaction.
Fixes#6938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811200135.25421-1-raphaelsc@scylladb.com>
Before this patch, modifying cdc/cdc_options.hh required recompiling 264
source files. This is because this header file was included by a couple
other header files - most notably schema.hh, where a forward declaration
would have been enough. Only the handful of source files which really
need to access the CDC options should include "cdc/cdc_options.hh" directly.
After this patch, modifying cdc/cdc_options.hh requires only 6 source files
to be recompiled.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200813070631.180192-1-nyh@scylladb.com>