If a node is notified of a schema change where the schema's dropped
columns have changes, that node will miss the changes to the dropped
columns. A scenario where this can happen is where a column c is
dropped, then added as a different typed, and then dropped again, with
a node n having seen the first drop and being notified of the
subsequent add and drop.
Fixes#2616
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725170622.4380-1-duarte@scylladb.com>
(cherry picked from commit 33e18a1779)
We need to consider the _live_endpoints size. The nr_live_nodes should
not be larger than _live_endpoints size, otherwise the loop to collect
the live node can run forever.
It is a regression introduced in commit 437899909d
(gossip: Talk to more live nodes in each gossip round).
Fixes#2637
Message-Id: <863ec3890647038ae1dfcffc73dde0163e29db20.1501026478.git.asias@scylladb.com>
(cherry picked from commit 515a744303)
* git@github.com:raphaelsc/scylla.git row_cache_fixes:
db: atomically synchronize cache with changes to the snapshot
db: refresh row cache's underlying data source after compaction
(cherry picked from commit 18be42f71a)
Empty clustering key range is perfectly valid and signifies that the
reader is not interested in anything but the static row. Let's not
make it mean anything else.
Message-Id: <20170725131220.17467-2-pdziepak@scylladb.com>
(cherry picked from commit 1ea507d6ae)
cache_streamed_mutation assumed that at least one clustering range was
specified. That was wrong since the readers are allowed to query just
for a static row (e.g. counter update that modifies only static
columns).
Fixes#2604.
Message-Id: <20170725131220.17467-1-pdziepak@scylladb.com>
(cherry picked from commit 6572f38450)
Boost 1.55 accidentally removed support for "range for" on
recursive_directory_iterator (previous and latter versions do
support it). Use old-style iteration instead.
Message-Id: <20170724080128.8824-1-avi@scylladb.com>
(cherry picked from commit c21bb5ae05)
This is in order to avoid frequent misses which have a relatively high
cost. A miss means we need to fetch schema definition from another
node and in case of writes do a schema merge.
If the schema is kept alive only by the incoming request, then it
will be forgotten immediately when the request is done, and the next
request using the same schema version will miss again.
Refs #2608.
Message-Id: <1500632447-10104-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 29a82f5554)
"Fixes issues uncovered in longevity test (#2608).
Main problem is that due to time drift scylla_tables.version column
may not get deleted on all nodes doing the schema merge, which will
make some nodes come up with different table schema version than others.
The inconsistency will not heal because scylla_tables doesn't
take part in the schema sync. This is fixed by the last patch.
This will cause nodes to constantly try to sync the schema, which under
some conditions triggers #2617."
* tag 'tgrabiec/fix-table-schema-version-inconsistency-v1' of github.com:scylladb/seastar-dev:
schema_tables: Add scylla_tables to ALL
schema: Make schema_mutations equality consistent with digest
schema_tables: Extract compact_for_schema_digest()
schema_tables: Always drop scylla_tables::version
(cherry picked from commit 937fe80a1a)
global_schema_ptr ensures that schema object is replicated to other
cores on access. It was replicating the "synced" state as well, but
only when the shard didn't know about the schema. It could happen that
the other shard has the entry, but it's not yet synced, in which case
we would fail to replicate the "synced" state. This will result in
exception from mutate(), which rejects attempts to mutate using an
unsynced schema.
The fix is to always replicate the "synced" state. If the entry is
syncing, we will preemptively mark it as synced earlier. The syncing
code is already prepared for this.
Refs #2617.
Message-Id: <1500555224-15825-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 65c64614aa)
Instead of retrying, just drop mutations that raced with a truncate.
* git@github.com:duarten/scylla.git truncate-reorder/v1:
database: Rename replay_position_reordered_exception
database: Drop mutations that raced with truncate
(cherry picked from commit 63caa58b70)
"This patchset restricts background writers - such as compactions,
streaming flushes and memtable flushes to a maximum amount of CPU usage
through a seastar::thread_scheduling_group.
The said maximum is recommended to be set 50 % - it is default
disabled, but can be adjusted through a configuration option until we
are able to auto-tune this.
The second patch in this series provides a preview on how such auto-tune
would look like. By implementing a simple controller we automatically
adjust the quota for the memtable writer processes, so that the rate at
which bytes come in is equal to the rates at which bytes are flushed.
Tail latencies are greatly reduced by this series, and heavy spikes that
previously appeared on CPU-bound workloads are no more."
* 'memtable-controller-v5' of https://github.com/glommer/scylla:
simple controller for memtable/streaming writer shares.
restrict background writers to 50 % of CPU.
(cherry picked from commit c5ee62a6a4)
"Fixes schema layout incompatibility in a mixed 1.7 and 2.0 cluster (#2555)
by reverting back to using the old layout in memory and thus also
in across-node requests. We still use the new v3 layout in schema
tables (needed by drivers and external tools). Translations happen
when converting to/from schema mutations."
* tag 'tgrabiec/use-v2-schema-layout-in-memory-v2' of github.com:scylladb/seastar-dev:
schema: Revert back to the 1.7 layout of static compact tables in memory
schema: Use v3 column layout when converting to/from schema mutations
schema: Encapsulate column layout translations in the v3_columns class
(cherry picked from commit 1daf1bc4bb)
Some places remained where code looked directly at
system_keyspace::NAME to determine iff a ks is
considered special/system/protected. Including
schema digest calculation.
Export "is_system_keyspace" and use accordingly.
Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 247c36e048)
Instead of copying and moving the bound, pass it by reference so the
transformer can decide whether it wants to copy or not. The only
caller so far doesn't want a copy and takes the value by reference,
which would be capturing a temporary value. Caught by the
view_schema_test with gcc7.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170705210255.29669-1-duarte@scylladb.com>
(cherry picked from commit 3dd0397700)
It is useful for larger cluster with larger gossip message latency. By
default the fd_max_interval_ms is 2 seconds which means the
failure_detector will ignore any gossip message update interval larger
than 2 seconds. However, in larger cluster, the gossip message udpate
interval can be larger than 2 seconds.
Fixes#2603.
Message-Id: <49b387955fbf439e49f22e109723d3a19d11a1b9.1500278434.git.asias@scylladb.com>
(cherry picked from commit adc5f0bd21)
branch 'tgrabiec/schema-migration-fixes' of github.com:scylladb/seastar-dev:
schema: Use proper name comparator
legacy_schema_migrator: Properly migrate non-UTF8 named columns
schema_tables: Store column_name in text form
legacy_schema_migrator: Migrate columns like Cassandra
schema_builder: Add factory method for default_names
legacy_schema_migrator: Simplify logic
thrift: Don't set regular_column_name_type
schema: Use proper column name type for static columns
schema: Fix column_name_type() for static compact tables
schema: Introduce clustering_column_at()
thrift: Reuse cell_comparator::to_sstring() for obtaining comparator type
partition_slice_builder: Use proper column's type instead of regular_column_name_type()
(cherry picked from commit 13caccf1cf)
Reduces view_schema_test runtime to 5 seconds, from 53 seconds on an NVMe disk
with write-back cache, and forever on a spinning disk.
Message-Id: <20170716081653.10018-1-avi@scylladb.com>
(cherry picked from commit d9c64ef737)
We will be creating links to those sstable's files, and those don't work
if the data directory and the test sstable are on different devices.
Copying the files to the same directory fixes the problem.
Message-Id: <20170716090405.14307-1-avi@scylladb.com>
(cherry picked from commit 9116dd91cb)
We don't ensure mutations are applied in memory following the order of their
replay positions. A memtable can thus be flushed with replay position rp,
with the new one being at replay position rp', where rp' < rp. This breaks
an intrinsic assumption in the code, which this series addresses.
Fixes#2074
branch memtable-flush/v3 of git@github.com:duarten/scylla.git:
commitlog: Always flush latest memtable
column_family: More precise count of switched memtables
column_family: Fix typo in pending_tasks metric name
column_family: More precise count of pending flushes
dirty_memory_manager: Remove unnecessary check from flush_one()
column_family: Don't rely on flush_queue to guarantee flushes finished
column_family: Don't bother closing the flush_queue on stop()
column_family: Stop using flush_queue
column_family: Remove outdated comment about the flush_queue
memtable: Stop tracking the highest flushed rp
(cherry picked from commit caa62f7f05)
In storage_proxy we arrange the mutations sent by the replicas in a
vector of vectors, such that each row corresponds to a partition key
and each column contains the mutation, possibly empty, as sent by a
particular replica.
There is reconciliation-related code that assumes that all the
mutations sent by a particular replica can be found in a single
column, but that isn't guaranteed by the way we initially arrange the
mutations.
This patch fixes this and enforces the expected order.
Fixes#2531Fixes#2593
Signed-off-by: Gleb Natapov <gleb@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170713162014.15343-1-duarte@scylladb.com>
(cherry picked from commit b8235f2e88)
Current algorithm was marking tables with regular columns not named
"value" as not dense, which doesn't have to be the case. It can be
either way.
It should be enough to look at clustering components. If there is a
clustering key, then table is dense if and only if all comparator
components belong to the clustering key.
If there is no clustering key, then if there are any regular columns
we're sure it's not dense.
Fixes#2587.
Message-Id: <1499877777-7083-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 30ec4af949)
The default of 2ms is somewhat arbitrary. Now that we have a lot more
mileage deploying Scylla applications in production it does sound not
only arbitrary, but high.
In particular, it is really hard to achieve 1ms latencies in the face of
CPU-heavy workloads with it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1499354495-27173-1-git-send-email-glauber@scylladb.com>
(cherry picked from commit 780a6e4d2e)
"Currently new nodes calculate digests based on v3 schema mutations,
which are very different from v2 mutations. As a result they will
use schemas with different table_schema_version that the old nodes.
The old nodes will not recognize the version and will try to request
its definition. That will fail, because old nodes don't understand
v3 schema mutations.
To fix this problem, let's preserve the digests during migration,
so that they're the same on new and old nodes. This will allow
requests to proceed as usual.
This does not solve the problem of schema being changed during
the rolling upgrade. This is not allowed, as it would bring the
same problem back.
Fixes #2549."
* tag 'tgrabiec/use-consistent-schema-table-digests-v2' of github.com:cloudius-systems/seastar-dev:
tests: Add test for concurrent column addition
legacy_schema_migrator: Set digest to one compatible with the old nodes
schema_tables: Persist table_schema_version
schema_tables: Introduce system_schema.scylla_tables
schema_tables: Simplify read_table_mutations()
schema_tables: Resurrect v2 read_table_mutations()
system_keyspace: Forward-declare legacy schemas
legacy_schema_migrator: Take storage_proxy as dependency
(cherry picked from commit a397889c81)
DowngradingConsistencyRetryPolicy uses live replicas count from
Unavailable exception to adjust CL for retry, but when there are pending
nodes CL is increased internally by a coordinator and that may prevent
retried query from succeeding. Adjust live replica count in case of
pending node presence so that retried query will be able to proceed.
Fixes#2535
Message-Id: <20170710085238.GY2324@scylladb.com>
(cherry picked from commit 739dd878e3)
Use name of the existing preceeding column with restriction
(last_column) instead of assuming that the column right after the
current column already has restrictions.
This will yield an error message that is different from that of
Cassandra, albeit still a correct one.
Fixes#2421
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40335768a2c8bd6c911b881c27e9ea55745c442e.1499781685.git.bdenes@scylladb.com>
(cherry picked from commit 33bc62a9cf)
CQL reply may contain metadata that describes columns present in the
response including the information about their type.
However, Scylla incorrectly reports counter types as bigint. The
serialised format of counters and bigint is exactly the same, which
could explain why the problem hasn't been noticed earlier but it is a
bug nevertheless.
Fixes#2569.
Message-Id: <20170711130520.27603-1-pdziepak@scylladb.com>
(cherry picked from commit 5aa523aaf9)
Otherwise we may deadlock, as explained in commit 5e8f0efc8:
Table drop starts with creating a snapshot on all shards. All shards
must use the same snapshot timestamp which, among other things, is
part of the snapshot name. The timestamp is generated using supplied
timestamp generating function (joinpoint object). The joinpoint object
will wait for all shards to arrive and then generate and return the
timestamp.
However, we drop tables in parallel, using the same joinpoint
instance. So joinpoint may be contacted by snapshotting shards of
tables A and B concurrently, generating timestamp t1 for some shards
of table A and some shards of table B. Later the remaining shards of
table A will get a different timestamp. As a result, different shards
may use different snapshot names for the same table. The snapshot
creation will never complete because the sealing fiber waits for all
shards to signal it, on the same name.
Message-Id: <1499762663-21967-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 310d2a54d2)
`r` is moved-from, and later captured in a different lambda. The compiler may
choose to move and perform the other capture later, resulting in a use-after-free.
Fix by copying `r` instead of moving it.
Discovered by sstable_test in debug mode.
Message-Id: <20170702082546.20570-1-avi@scylladb.com>
(cherry picked from commit 07b8adce0e)
Configuring cpufreq service on VMs/IaaS causes an error because it doesn't supported cpufreq.
To prevent causing error, skip whole configuration when the driver not loaded.
Fixes#2051
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498809504-27029-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 1c35549932)