Fixes#4640
Iterating extensions in commitlog.cc should mimic that in sstables.cc,
i.e. a simple future-chain. Should also use same order for read and
write open, as we should preserve transformation stack order.
Message-Id: <20190702150028.18042-1-calle@scylladb.com>
On recent changes install.sh mistakenly copies dist/common/scripts to
/opt/scylladb/scripts/scripts, it should be /opt/scylladb/scripts.
Same on /opt/scylladb/scyllatop as well.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190702120030.13729-1-syuu@scylladb.com>
If someone opens a connection to port 9042 and sends some random bytes,
there is a 1 in 64 probability we'll recognize it as a valid frame
(since we only check the version byte, allowing versions 1-4) and we'll
try to read frame.length bytes for the body. If this value is very large,
we'll run out of memory very quickly.
Fix this by checking for reasonable body size (100kB). The initial message
must be a STARTUP, whose body is a [string map] of options, of which just
three are recognized. 100kB is plenty for future expansion.
Note that this does not replace true security on listening ports and
only serves to protect against mistakes, not attacks. An attacker can
easily exhaust server memory by opening many connections and trickle-feeding
them small amounts of data so they appear alive.
We can't use the config item native_transport_max_frame_size_in_mb,
because that can be legitimately large (and the default is atrocious,
256MB).
Fixes#4366.
This patchset allows changing the configuration at runtime, The user
triggers this by editing the configuration file normally, then
signalling the database with SIGHUP (as is traditional).
The implementation is somewhat complicated due the need to store
non-atomic mutable state per-shard and to synchronize the values in
all shards. This is somewhat similar to Seastar's sharded<>, but that
cannot be used since the configuration is read before Seastar is
initialized (due to the need to read command-line options).
Tests: unit (dev, debug), manual test with extra prints (dev)
Ref #2689Fixes#2517.
* seastar b629d5ef7a...a5b9f77d52 (6):
> perftune.py: add comment explaining why we don't log errors when binding NVMe IRQs for all but i3.nonmetal machines
> sharded: do a two phase shutdown for sharded services
> chunked_fifo: add iterator
> perftune.py: fix the i3 metal detection pattern
> core/memory: remove translation api
> reactor: file_type: offer option to not follow symbolic links
Change the type from bool to updateable_value<bool> throughout the dependency
chain and mark it as live updateable.
In theory we should also observe the value and trigger compaction if it changes,
but I don't think it is worthwhile.
Once we start using updateable_value<>, we must make it refer
to the updateable_value_source<> on the same shard, and to do
that we need to call broadcast_to_all_shards() first (this
creates the per-shard copy).
With dynamically updateable configuration, tracking the source of a value
is more important, since we'll accept or reject updates depending on the source.
Fix the source of client_encryption_options, which we RMW, by preserving the original
source.
Replace the per-shard value we store with an updateable_value_source, which
allows updating it dynamically and allows users to track changes.
The broadcast_to_all_shards() function is augmented to apply modifications
when called on a live system.
Since some of our values are not atomic (strings) and the administrative
information needed to track references to values is also not atomic, we will
need to store them per-shard. To do that we add a vector of per-shard data
to config_file, where each element is itself a vector of configuration items.
Since we need to operate generically on items (copying them from shard to shard)
we store them in a type-erased form.
Only mutable state is stored per-shard.
The updateable_value and updateable_value_source classes allow broadcasting
configuration changes across the application. The updateable_value_source class
represents a value that can be updated, and updateable_value tracks its source
and reflects changes. A typical use replaces "uint64_t config_item" with
"updateable_value<uint64_t> config_item", and from now on changes to the source
will be reflected in config_item. For more complicated uses, which must run some
callback when configuration changes, you can also call
config_item.observe(callback) to be actively notified of changes.
Dynamically updateable configuration requires checking whether configuration items
changed or not, so we can skip firing notifiers for the common case where nothing
changed.
This patch adds a comparison operator for seed_provider_type, which was missing it.
Currently, we allow adjusting configuration via
cfg.whatever() = 5;
by returning a mutable reference from cfg.whatever(). Soon, however, this operation
will have side effects (updating all references to the config item, and triggering
notifiers). While this can be done with a proxy, it is too tricky.
Switch to an ordinary setter interface:
cfg.whatever.set(5);
Because boost::program_options no longer gets a reference to the value to be written
to, we have to move the update to a notifier, and the value_ex() function has to
be adjusted to infer whether it was called with a vector type after it is
called, not before.
config_file and db::config are soon not going to be copyable. The reason is that
in order to support live updating, we'll need per-shard copies of each value,
and per-shard tracking of references to values. While these can be copied, it
will be an asycnronous operation and thus cannot be done from a copy constructor.
So to prepare for these changes, replace all copies of db::config by references
and delete config_file's copy constructor.
Some existing references had to be made const in order to adapt the const-ness
of db::config now being propagated (rather than being terminated by a non-const
copy).
Copying the config object breaks the link between the original and the copied
object, so updates to config items will not be visible. To allow updates, don't
copy any more, and instead keep a pointer.
The pointer won't work will once config is updateable, since the same object is
shared across multiple shard, but that can be addressed later.
Currently, database::_cfg is a copy of the global configuration. But this means
that we have multiple master copies of the configuration, which makes updating
the configuration harder. In order to eliminate the copy we have to eliminate the
database default constructor, which creates a config object, so that all
remaining constructors can receive config by reference and retain that reference.
We want to eliminate the default database constructor (to be explained in
the next patch), so eliminate its only use in gossip_test, using the regular
constructor instead.
With this unused descriptors and objects should always be poisoned.
* https://github.com/espindola/scylla/ align-descriptors-so-that-they-are-poisoned-v4:
Convert macros to inline functions
More precise poisoning in logalloc
On current scylla.spec, shell glob pattern "scylla_*setup" does not correctly
expanded, it mistakenly created a symlink named "/usr/sbin/scylla_*setup".
We need to expand them, need to create symlinks for each setup scripts.
Fixes#4605
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190627053530.10406-2-syuu@scylladb.com>
Since we don't install .pyc files on our package, python3 will generate .pyc
file when we launch setup script first time.
Then we will have unmanaged files under script directory, it will remain when
Scylla package upgraded / removed.
We need to compile *.py when we generate relocatable package, add compiled .pyc
files on .rpm/.deb packages.
Fixes#4612
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190627053530.10406-1-syuu@scylladb.com>
This change aligns descriptors and values to 8 bytes so that poisoning
a descriptor or value doesn't interfere with other descriptors and
values.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
When writing streamed data into sstables, while using time window
compaction strategy, we have to emit a new sstable for each time window.
Otherwise we can end up with sstables, mixing data from wildly different
windows, ruining the compaction strategy's ability to drop entire
sstables when all data within is expired. This gets worse as these mixed
sstables get compacted together with sstables that used to contain a
single time window.
This series provides a solution to this by segregating the data by its
atom's the time-windows. This is done on the new RPC streaming and the
new row-level, repair, memtable-flush and compaction, ensuring that the
segregation requirement is respected at all times.
Fixes: #2687
"
* 'segregate-data-into-sstables-by-time-window-streaming/v2.1' of ssh://github.com/denesb/scylla:
streaming,repair: restore indentation
repair: pass the data stream through the compaction strategy's interposer consumer
streaming: pass the data stream through the compaction strategy's interposer consumer
TWCS: implement add_interposer_consumer()
compaction_strategy: add add_interposer_consumer()
Add mutation_source_metadata
tests: add unit test for timestamp_based_splitting_writer
Add timestamp_based_splitting_writer
Introduce mutation_writer namespace
Most of our tests use overly simplistic schemas (`simple_schema`) or
very specialized ones that focus on exercising a specific area of the
tested code. This is fine in most places as not all code is schema
dependent, however practice has showed that there can be nasty bugs
hiding in dark corners that only appear with a schema that has a
specific combination of types.
This series introduces `tests::random_schema` a utility class for
generating random schemas and random data for them. An important goal is
to make using random schemas in tests as simple and convenient as
possible, therefore fostering the appearance of tests using random
schemas.
Random schema was developed to help testing code I'm currently working
on, which segregates data by time-windows. As I wasn't confident in my
ability to think of every possible combination of types that can break
my code I came up with random-schema to help me finding these corner
cases. So far I consider it a success, it already found bugs in my code
that I'm not sure I would have found if I had relied on specific
schemas. It also found bugs in unrelated areas of the code which proves
my point in the first paragraph.
* https://github.com/denesb/scylla.git random_schema/v5:
tests/data_model: approximate to the modeled data structures
data_value: add ascii constructor
tests/random-utils.hh: add stepped_int_distribution
tests/random-utils.hh: get_int() add overloads that accept external
rand engine
tests/random-utils.hh: add get_real()
tests: introduce random_schema
Exploit the interposer customization point to inject a consumer that will
segregate the mutation stream based on the contained atoms' timestamps,
allowing the requirements of TWCS to be mantained every time sstables
are written to disk.
For the implementation, `timestamp_based_splitting_writer` is used,
with a classifier that maps timestamps to windows.
This is a band-aid patch that is supposed to fix the immediate problem
of large collections causing large allocations. The proper fix is to
use IMR but that will take time. In the meanwhile alleviate the
pressure on the memory allocator by using a chunked storage collection
(utils::chunked_vector) instead of std::vector. In the linked issue
seastar::chunked_fifo was also proposed as the container to use,
however chunked fifo is not traversable in reverse which disqualifies
it from this role.
Refs: #3602
This will be the customization point for compaction strategies, used to
inject a specific interposer consumer that can manipulate the fragment
stream so that it satisfies the requirements of the compaction strategy.
For now the only candidate for injecting such an interposer is
time-window compaction strategy, which needs to write sstables that
only contains atoms belonging to the same time-window. By default no
interposer is injected.
Also add an accompanying customization point
`adjust_partition_estimate()` which returns the estimated per-sstable
partition-estimate that the interposer will produce.
This struct contains metadata regarding to a mutation_source. Currently
it contains the min and max timestamp. This will be used later by
compaction strategies to determine whether a given mutation stream has
to be split or not.
This writer implements the core logic of time-window based data
segregation. It splits the fragment stream provided by a reader, such
that each atom (cell) in the stream will be written into a consumer
based on the time-window its timestamp belongs to. The end result is
that each consumer will only see fragments, whoose atoms all have
timestamps belonging to the same time-window.
When a mutation fragment has atoms belonging to different time-windows,
it is split into as many fragments as needed so each has only atoms
that belong to the same time-window.
Currently there is a single mutation_writer: `multishard_writer`,
however in the next path we are going to add another one. This is the
right moment to move these into a common namespace (and folder), we
have way too much stuff scattered already in the top-level namespace
(and folder).
Also rename `tests/multishard_writer_test.cc` to
`tests/mutation_writer_test.cc`, this test-suite will be the home of all
the different mutation writer's unit test cases.
"
Currently, parser and the consumer save its state and return the
control to the caller, which then figures out that it needs to enter a
new partition, and that it doesn't need to skip. We do it twice, after
row end, and after row start. All this work could be avoided if the
consumer installed by the reader adjusted its state and pushed the
fragments on the spot. This patch achieves just that.
This results in less CPU overhead.
The ka/la reader is left still stopping after row end.
Brings a 20% improvement in frag/s for a full scan in perf_fast_forward (Haswell, NVMe):
perf_fast_forward -c1 -m1G --run-tests=small-partition-skips:
Before:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.952372 4 1000000 1050009 755 1050765 1046585 976.0 971 124256 1 0 0 0 0 0 0 0 99.7%
After:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.790178 4 1000000 1265538 1150 1266687 1263684 975.0 971 124256 2 0 0 0 0 0 0 0 99.6%
Tests: unit (dev)
"
* 'sstable-optimize-partition-scans' of https://github.com/tgrabiec/scylla:
sstable: mc: reader: Do not stop parsing across partitions
sstables: reader: Move some parser state from sstable_mutation_reader to mp_row_consumer_reader
sstables: reader: Simplify _single_partition_read checking
sstables: reader: Update stats from on_next_partition()
sstables: mutation_fragment_filter: Drop unnecessary calls to _walker.out_of_range()
sstables: ka/la: reader make push_ready_fragments() safe to call many times
sstables: mc: reader: Move out-of-range check out of push_ready_fragments()
sstables: reader: Return void from push_ready_fragments()
sstables: reader: Rename on_end_of_stream() to on_out_of_clustering_range()
sstables: ka/la: reader: Make sure push_ready_fragments() does not miss to emit partition_end
O_DSYNC causes commitlog to pre-allocate each commitlog segment by writing
zeroes into it. In normal operation, this is amortized over the many
times the segment will be reused. In tests, this is wasteful, but under
the default workstation configuration with /tmp using tmpfs, no actual
writes occur.
However on a non-default configuration with /tmp mounted on a real disk,
this causes huge disk I/O and eventually a crash (observed in
schema_change_test). The crash is likely only caused indirectly, as the
extra I/O (exacerbated by many tests running in parallel) xcauses timeouts.
I reproduced this problem by running 15 copies of schema_change_test in
parallel with /tmp mounted on a real filesystem. Without this change, I
usually observe one or two of the copies crashing, with the change they
complete (and much more quickly, too).
The tracker object was a static object in repair.cc. At the time we initialize
it, we do not know the smp::count, so we have to initialize the _repairs
object when it is used on the fly.
void init_repair_info() {
if (_repairs.size() != smp::count) {
_repairs.resize(smp::count);
}
}
This introduces a race if init_repair_info is called on different
thread(shard).
To fix, put the tracker object inside the newly introduced
repair_service object which is created in main.cc.
Fixes#4593
Message-Id: <b1adef1c0528354d2f92f8aaddc3c4bee5dc8a0a.1561537841.git.asias@scylladb.com>
This is quick fix to the immediate problem of large collections causing
large allocations, triggering stalls or OOM. The proper fix is to
use IMR for storing the cells, but that is a complex change that will
require time, so let's not stall/OOM in the meanwhile.
This series makes sure new schema is propagated to repair master and
follower nodes before repair.
Fixes#4575
* dev.git asias/repair_pull_schema_v2:
migration_manager: Add sync_schema
repair: Sync schema from follower nodes before repair
"
If the database supports infinite bound range deletions,
CQL layer will no longer throw an error indicating that both ranges
need to be specified.
Fixes#432
Update test_range_deletion_scenarios unit test accordingly.
"
* 'cql3-lift-infinite-bound-check' of https://github.com/bhalevy/scylla:
cql3: lift infinite bound check if it's supported
service: enable infinite bound range deletions with mc
database: add flag for infinite bound range deletions
Piotr Sarna says:
Fixes#4540
This series adds proper handling of aggregation for paged indexed queries.
Before this series returned results were presented to the user in per-page
partial manner, while they should have been returned as a single aggregated
value.
Tests: unit(dev)
Piotr Sarna (8):
cql3: split execute_base_query implementation
cql3: enable explicit copying of query_options
cql3: add a query options constructor with explicit page size
cql3: add proper aggregation to paged indexing
cql3: make DEFAULT_COUNT_PAGE_SIZE constant public
tests: add query_options to cquery_nofail
tests: add indexing + paging + aggregation test case
tests: add indexing+paging test case for clustering keys
* seastar ded50bd8a4...b629d5ef7a (9):
> sharded: no_sharded_instance_exception: fix grammar
> core,net: output_stream: remove redundant std::move()
> perftune: make sure that ethtool -K has a chance of succeeding
> net/dpdk: upgrade to dpdk-19.05
> perftune.py: Fix a few more places where we use deprecated pyudev.Device ones
> reactor: provide an uptime function
> rpc: add sink::flush() to streaming api
> Use a table to document the various build modes
> foreign_ptr: Fix compilation error due to unused variable