Store the failure_detector object inside gossiper object.
- No more the global object sharded<failure_detector>
- No need to initialize sharded<failure_detector> manually which
simplifies the code in tests/cql_test_env.cc and init.cc.
"
This series adds support for local indexing, i.e. when the index table
resides on the same partition as base data.
It addresses the performance issue of having an indexed query
that also specifies a partition key - index will be queried
locally.
"
* 'add_local_indexing_11' of https://github.com/psarna/scylla: (30 commits)
tests: add cases for local index prefix optimization
tests: add create/drop local index test case
tests: add non-standard names cases to local index tests
tests: add multi pk case for local index tests
tests: add test for malformed local index definitions
tests: add local index paging test
tests: add local indexing test
cql3: add CREATE INDEX syntax for local indexes
cql3: use serialization function to create index target string
index: add serialization function for index targets
index: use proper local index target when adding index
index: add parsing target column name from local index targets
db: add checking for local index in schema tables
index: add checking if serialized target implies local index
index: enable parsing multi-key targets
index: move target parser code to .cc file
json: add non-throwing overload for to_json_value
cql3: add checking for local indexes in has_supporting_index()
cql3: move finding index restrictions to prepare stage
cql3: add picking an index by score
...
db::schema_tables::ALL and db::schema_tables::all_tables() are both supposed
to list the same schema tables - the former is the list of their names, and
the latter is the list of their schemas. This code duplication makes it easy
to forget to update one of them, and indeed recently the new
"view_virtual_columns" was added to all_tables() but not to ALL.
What this patch does is to make ALL a function instead of constant vector.
The newly named all_table_names() function uses all_tables() so the list
of schema tables only appears once.
So that nobody worries about the performance impact, all_table_names()
caches the list in a per-thread vector that is only prepared once per thread.
Because after this patch all_table_names() has the "view_virtual_columns"
that was previously missing, this patch also fixes#4339, which was about
virtual columns in materialized views not being propagated to other nodes.
Unfortunately, to test the fix for #4339 we need a test with multiple
nodes, so we cannot test it here in a unit test, and will instead use
the dtest framework, in a separate patch.
Fixes#4339
Branches: 3.0
Tests: all unit tests (release and debug mode), new dtest for #4339. The unit test mutation_reader_test failed in debug mode but not in release mode, but this probably has nothing to do with this patch (?).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190320063437.32731-1-nyh@scylladb.com>
This is analogous to the system.large_rows table, but holds individual
cells, so it also needs the column name.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this changes the futures returned by large_data_handler will not
normally wait for entries to be written to system.large_rows or
system.large_partitions.
We use a semaphore to bound how behind system.large_* table updates
can get.
This should avoid delaying sstables writes in the common case, which
is more relevant once we warn of large cells since the the default
threshold will be just 1MB.
Note that there is no ordering between the various maybe_record_* and
maybe_delete_large_data_entries requests. This means that we can end
up with a stale entry that is only removed once the TTL expires.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These will use a member semaphore variable in a followup patch, so they
cannot be const.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This should have been changed in the patch
db: stop the commit log after the tables during shutdown
But unfortunately I missed it then.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We had almost identical error handling for large_partitions and
large_rows. Refactor in preparation for large_cells.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This renames it to record_large_partitions, which matches
record_large_rows. It also changes the signature to be closer to
record_large_rows.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The code for deleting entries from system.large_partitions was almost
a duplicate from the code for deleting entries from system.large_rows.
This patch unifies the two, which also improves the error message when
we fail to delete entries from system.large_partitions.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This allows for system.large_partitions to be updated if a large
partition is found while writing the last sstables.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
This fixes#3988.
We already have a system.large_partitions, but only a warning for
large rows. These patches close the gap by also recording large rows
into a new system.large_rows.
"
* 'espindola/large-row-add-table-v6' of https://github.com/espindola/scylla:
Add a testcase for large rows
Populate system.large_rows.
Create a system.large_rows table
Extract a key_to_str helper
Don't call record_large_rows if stopped
Add a delete_large_rows_entries method to large_data_handler
db::large_data_handler::(maybe_)?record_large_rows: Return future<> instead of void
Rename maybe_delete_large_partitions_entry
Rename log_large_row to record_large_rows
Rename maybe_log_large_row to maybe_record_large_rows
"
This series contains minor improvements to commitlog log messages that
have helped investigating #4231, but are not specific to that bug.
"
* tag 'improve-commitlog-logs/v1' of https://github.com/pdziepak/scylla:
commitlog: use consistent chunk offsets in logs
commitlog: provide more information in logs
commitlog: remove unnecessary comment
Logs in commitlog writer use offset in the file of the chunk header to
identify chunks. However, the replayer is using offset after the header
for the same purpose. This causes unnecessary confusion suggesting that
the replayer is reading at the wrong position.
This patch changes the replayer so that it reports chunk header offsets.
This commits adds some more information to the logs. Motivated, by
experiences with investigating #4231.
* size of each write
* position of each write
* log message for final write
Commitlog files contain multiple chunks. Each chunk starts as a single
(possibly, fragmented buffer). The size of that buffer in memory may be
larger than the size in the file.
cycle() was incorrectly using the in-memory size to write the whole
buffer to the file. That sometimes caused data corruption, since a
smaller on-file size was used to compute the offset of the next chunk
and there could be multiple chunk writes happening at the same time.
This patch solves the issue by ensuring that only the actual on-file
size of the chunk is written.
When looking for optimization paths, columns selected in a view
are checked against multiple conditions - unfortunately virtual
columns were erroneously skipped from that check, which resulted
in ignoring their TTLs. That can lead to overoptimizing
and not including vital liveness info into view rows,
which can then result in row disappearing too early.
It now records large rows when they are first written to an sstable
and removes them when the sstable is deleted.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This is analogous to the system.large_partitions table, but holds
individual rows, so it also needs the clustering key of the large
rows.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The implementations large_data_handler should only be called if
large_data_handler hasn't been stopped yet.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These functions will record into tables in a followup patch, so they
will need to return a future.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The shard-aware drivers can cause a huge amount of connections to be created
when there are tens of thousands of clients. While normally the shard-aware
drivers are beneficial, in those cases they can consume too much memory.
Provide an option to disable shard awareness from the server (it is likely to
be easier to do this on the server than to reprovision those thousands of
clients).
Tests: manual test with wireshark.
Message-Id: <20190223173331.24424-1-avi@scylladb.com>
1. We would like to be able to call maybe_delete_large_partitions_entry
from the sstable destructor path in the future so the sstable might go away
while the large data entries are being deleted.
2. We would like the caller to handle any exception on this path,
especially in the prepatation part, before calling delete_large_partitions_entry().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Can be useful in diagnosing problems with application of schema
mutations.
do_merge_schema() is called on every change of schema of the local
node.
create_table_from_mutations() is called on schema merge when a table
was altered or created using mutations read from local schema tables
after applying the change, or when loading schema on boot.
Message-Id: <20190221093929.8929-2-tgrabiec@scylladb.com>
"
cryptopp's config.h has the following pragma:
#pragma GCC diagnostic ignored "-Wunused-function"
It is not wrapped in a push/pop. Because of that, including cryptopp
headers disables that warning on scylla code too.
This patch series introduces a single .cc file that has to include
cryptopp headers.
"
* 'avoid-cryptopp-v3' of https://github.com/espindola/scylla:
Avoid including cryptopp headers
Delete dead code
cryptopp's config.h has the following pragma:
#pragma GCC diagnostic ignored "-Wunused-function"
It is not wrapped in a push/pop. Because of that, including cryptopp
headers disables that warning on scylla code too.
The issue has been reported as
https://github.com/weidai11/cryptopp/issues/793
To work around it, this patch uses a pimpl to have a single .cc file
that has to include cryptopp headers.
While at it, it also reduces the differences and code duplication
between the md5 and sha1 hashers.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
In some cases generating view updates for columns that were not
selected in CREATE VIEW statement is redundant - it is the case
when the update will not influence row liveness in anyway.
Currently, these cases are optimized out:
- row marker is live and only unselected columns were updated;
- row marked is not live and only unselected columns were updated,
and in the process nothing was created or deleted and there was
no TTL involved;
It's detrimental to keep querying index manager whether a view
is backing a secondary index every time, so this value is cached
at construct time.
At the same time, this value is not simply passed to view_info
when being created in secondary index manager, in order to
decouple materialized view logic from secondary indexes as much as
possible (the sole existence of is_index() is bad enough).
Give the constant 1024*1024 introduced in an earlier commit a name,
"batch_memory_max", and move it from view.cc to view_builder.hh.
It now resides next to the pre-existing constant that controlled how
many rows were read in each build step, "batch_size".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190217100222.15673-1-nyh@scylladb.com>
The included testcase used to crash because during database::stop() we
would try to update system.large_partition.
There doesn't seem to be an order we can stop the existing services in
cql_test_env that makes this possible.
This patch then adds another step when shutting down a database: first
stop updating system.large_partition.
This means that during shutdown any memtable flush, compaction or
sstable deletion will not be reflected in system.large_partition. This
is hopefully not too bad since the data in the table is TTLed.
This seems to impact only tests, since main.cc calls _exit directly.
Tests: unit (release,debug)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190213194851.117692-1-espindola@scylladb.com>
The bulk materialized-view building processes (when adding a materialized
view to a table with existing data) currently reads the base table in
batches of 128 (view_builder::batch_size) rows. This is clearly better
than reading entire partitions (which may be huge), but still, 128 rows
may grow pretty large when we have rows with large strings or blobs,
and there is no real reason to buffer 128 rows when they are large.
Instead, when the rows we read so far exceed some size threshold (in this
patch, 1MB), we can operate on them immediately instead of waiting for
128.
As a side-effect, this patch also solves another bug: At worst case, all
the base rows of one batch may be written into one output view partition,
in one mutation. But there is a hard limit on the size of one mutation
(commitlog_segment_size_in_mb, by default 32MB), so we cannot allow the
batch size to exceed this limit. By not batching further after 1MB,
we avoid reaching this limit when individual rows do not reach it but
128 of them did.
Fixes#4213.
This patch also includes a unit test reproducing #4213, and demonstrating
that it is now solved.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190214093424.7172-1-nyh@scylladb.com>
Fixes#4222
Iff an extension creation callback returns null (not exception)
we treat this as "I'm not needed" and simply ignore it.
Message-Id: <20190213124311.23238-1-calle@scylladb.com>
Fixes#4083
Instead of sharded collection in system.local, use a
dedicated system table (system.truncated) to store
truncation positions. Makes query/update easier
and easier on the query memory.
The code also migrates any existing truncation
positions on startup and clears the old data.
Refs #4085
Changes commitlog descriptor to both accept "Recycled-Commitlog..."
file names, and preserve said name in the descriptor.
This ensures we pick up the not-yet-used recycled segments left
from a crash for replay. The replay in turn will simply ignore
the recycled files, and post actual replay they will be deleted
as needed.
Message-Id: <20190129123311.16050-1-calle@scylladb.com>
Right now Cassandra SSTables with counters cannot be imported into
Scylla. The reason for that is that Cassandra changed their counter
representation in their 2.1 version and kept transparently supporting
both representations. We do not support their old representation, nor
there is a sane way to figure out by looking at the data which one is in
use.
For safety, we had made the decision long ago to not import any
tables with counters: if a counter was generated in older Cassandra, we
would misrepresent them.
In this patch, I propose we offer a non-default way to import SSTables
with counters: we can gate it with a flag, and trust that the user knows
what they are doing when flipping it (at their own peril). Cassandra 2.1
is by now pretty old. many users can safely say they've never used
anything older.
While there are tools like sstableloader that can be used to import
those counters, there are often situations in which directly importing
SSTables is either better, faster, or worse: the only option left. I
argue that having a flag that allow us to import them when we are sure
it is safe is better than having no option at all.
With this patch I was able to successfully import Cassandra tables with
counters that were generated in Cassandra 2.1, reshard and compact their
SSTables, and read the data back to get the same values in Scylla as in
Cassandra.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190210154028.12472-1-glauber@scylladb.com>