A general build system knows about 3 machines:
* build: where the building is running
* host: where the built software will run
* target: the machine the software will produce code for
The target machine is only relevant for compilers, so we can ignore
it.
Until now we could ignore the build and host distinction too. This
patch adds the first difference: don't use host ld_flags when linking
build tools (gen_crc_combine_table).
The reason for this change is to make it possible to build with
-Wl,--dynamic-linker pointing to a path that will exist on the host
machine, but may not exist on the build machine.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191207030408.987508-1-espindola@scylladb.com>
The vm.swappiness sysctl controls the kernel's prefernce for swapping
anonymous memory vs page cache. Since Scylla uses very large amounts
of anonymous memory, and tiny amounts of page cache, the correct setting
is to prefer swapping page cache. If the kernel swaps anonymous memory
the reactor will stall until the page fault is satisfied. On the other
hand, page cache pages usually belong to other applications, usually
backup processes that read Scylla files.
This setting has been used in production in Scylla Cloud for a while
with good results.
Users can opt out by not installing the scylla-kernel-conf package
(same as with the other kernel tunables).
* seastar 166061da3...e440e831c (8):
> Fail tests on ubsan errors
> future: make a couple of asserts more strict
> future: Move make_ready out of line
> config: Do not allow zero rates
Fixes#5360
> future: add new state to avoid temporaries in get_available_state().
> future: avoid temporary future_state on get_available_state().
> future: inline future::abandoned
> noncopyable_function: Avoid uninitialized warning on empty types
Previously, scylla used min/max(blob)->blob overload for collections,
tuples and UDTs; effectively making the results being printed as blobs.
This PR adds "dynamically"-typed min()/max() functions for compound types.
These types can be complicated, like map<int,set<tuple<..., and created
in runtime, so functions for them are created on-demand,
similarly to tojson(). The comparison remains unchanged - underneath
this is still byte-by-byte weak lex ordering.
Fixes#5139
* jul-stas/5139-minmax-bad-printing-collections:
cql_query_tests: Added tests for min/max/count on collections
cql3: min()/max() for collections/tuples/UDTs do not cast to blobs
This tests new min/max function for collections and tuples. CFs
in test suite were named according to types being tested, e.g.
`cf_map<int,text>' what is not a valid CF name. Therefore, these
names required "escaping" of invalid characters, here: simply
replacing with '_'.
Before:
cqlsh> insert into ks.list_types (id, val) values (1, [3,4,5]);
cqlsh> select max(val) from ks.list_types;
system.max(val)
------------------------------------------------------------
0x00000003000000040000000300000004000000040000000400000005
After:
cqlsh> select max(val) from ks.list_types;
system.max(val)
--------------------
[3, 4, 5]
This is accomplished similarly to `tojson()`/`fromjson()`: functions
are generated on demand from within `cql3::functions::get()`.
Because collections can have a variety of types, including UDTs
and tuples, it would be impossible to statically define max(T t)->T
for every T. Until now, max(blob)->blob overload was used.
Because `impl_max/min_function_for` is templated with the
input/output type, which can be defined in runtime, we need type-erased
("dynamic") versions of these functors. They work identically, i.e.
they compare byte representations of lhs and rhs with
`bytes::operator<`.
Resolves#5139
If you merge a pull request that contains multiple patches via
the github interface, it will document itself as the committer.
Work around this brain damage by using the command line.
Introduce a new verb dedicated for receiving and sending hints: HINT_MUTATION. It is handled on the streaming connection, which is separate from the one used for handling mutations sent by coordinator during a write.
The intent of using a separate connection is to increase fairness while handling hints and user requests - this way, a situation can be avoided in which one type of requests saturate the connection, negatively impacting the other one.
Information about new RPC support is propagated through new gossip feature HINTED_HANDOFF_SEPARATE_CONNECTION.
Fixes#4974.
Tests: unit(release)
To make gms::inet_address::to_string() similar in output to origin.
The sole purpose being quick and easy fix of API/JMX ipv6
formatting of endpoints etc, where strings are used as lexical
comparisons instead of textual representation.
A better, but more work, solution is to fix the scylla-jmx
bridge to do explicit parse + re-format of addresses, but there
are many such callpoints.
An even better solution would be to fix nodetool to not make this
mistake of doing lexical comparisons, but then we risk breaking
merge compatibility. But could be an option for a separate
nodeprobe impl.
Message-Id: <20191204135319.1142-1-calle@scylladb.com>
Currently query_options objects is passed to a trace stopping function
which makes it mandatory to make them alive until the end of the
query. The reason for that is to add prepared statement parameters to
the trace. All other query options that we want to put in the trace are
copied into trace_state::params_values, so lets copy prepared statement
parameters there too. Trace enabled case will become a little bit more
expensive but on the other hand we can drop a continuation that holds
query_options object alive from a fast path. It is safe to drop the call
to stop_foreground_prepared() here since The tracing will be stopped
in process_request_one().
Message-Id: <20191205102026.GJ9084@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5381 by
Peng Jian, fixing multiple small issues with Redis:
* Rename the options related to Redis API, and describe them clearly.
* Rename redis_transport_port to redis_port
* Rename redis_transport_port_ssl to redis_ssl_port
* Rename redis_default_database_count to redis_database_count
* Remove unnecessary option enable_redis_protocol
* Modify the default value of opition redis_read_consistency_level and redis_write_consistency_level to LOCAL_QUORUM
* Fix the DEL command: support to delete mutilple keys in one command.
* Fix the GET command: return the empty string when the required key is not exists.
* Fix the redis-test/test_del_non_existent_key: mark xfail.
On aarch64, asan detected a use-after-move. It doesn't happen on x86_64,
likely due to different argument evaluation order.
Fix by evaluating full_slice before moving the schema.
Note: I used "auto&&" and "std::move()" even though full_slice()
returns a reference. I think this is safer in case full_slice()
changes, and works just as well with a reference.
Fixes#5419.
This commit makes sure that single-partition readers for
read-before-write do not have fast-forwarding enabled,
as it may lead to huge read amplification. The observed case was:
1. Creating an index.
CREATE INDEX index1 ON myks2.standard1 ("C1");
2. Running cassandra-stress in order to generate view updates.
cassandra-stress write no-warmup n=1000000 cl=ONE -schema \
'replication(factor=2) compaction(strategy=LeveledCompactionStrategy)' \
keyspace=myks2 -pop seq=4000000..8000000 -rate threads=100 -errors
skip-read-validation -node 127.0.0.1;
Without disabling fast-forwarding, single-partition readers
were turned into scanning readers in cache, which resulted
in reading 36GB (sic!) on a workload which generates less
than 1GB of view updates. After applying the fix, the number
dropped down to less than 1GB, as expected.
Refs #5409Fixes#4615Fixes#5418
This test execution time dominates by a serious margin
test execution time in dev/release mode: reducing its
execution time improves the test.py turnaround by over 70%.
Message-Id: <20191204135315.86374-2-kostja@scylladb.com>
Currently, 'scylla thread' uses arch_prctl() to extract the value of
fsbase, used to reference thread local variables. gdb 8 added support
for directly accessing the value as $fs_base, so use that instead. This
works from core dumps as well as live processes, as you don't need to
execute inferior functions.
The patch is required for debugging threads in core dumps, but not
sufficient, as we still need to set $rip and $rsp, and gdb still[1]
doesn't allow this.
[1] https://sourceware.org/bugzilla/show_bug.cgi?id=9370
The feature introduced by this commit declares that hints can be sent
using the new dedicated RPC verb. Before using the new verb, nodes need
to know if other nodes in the cluster will be able to handle the new
RPC verb.
Introduce a new verb dedicated for receiving and sending hints:
HINT_MUTATION. It is handled on the streaming connection, which is
separate from the one used for handling mutations sent by coordinator
during a write.
The intent of using a separate connection is to increase fariness while
handling hints and user requests - this way, a situation can be avoided
in which one type of requests saturate the connection, negatively
impacting the other one.
"
In several cases in distributed testing (dtest) we trigger compaction using nodetool compact assuming that when it is done, it is indeed really done.
However, the way compaction is currently implemented in scylla, it may leave behind some background tasks to delete the old sstables that were compacted.
This commit changes major compaction (triggered via the ss::force_keyspace_compaction api) so it would wait on the background deletes and will return only when they finish.
Fixes#4909
Tests: unit(dev), nodetool_refresh_with_data_perms_test, test_nodetool_snapshot_during_major_compaction
"
We may able to use chrony setup script on future version of RHEL/CentOS,
it better to run chrony setup when RHEL version >= 8, not only 8.
Note that on Fedora it still provides ntp/ntpdate package, so we run
ntp setup on it for now. (same on debian variants)
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191203192812.5861-1-syuu@scylladb.com>
`segment_manager' now uses a decorated version of `timed_out_error'
with hardcoded name. On the other hand `region_group' uses named
`on_request_expiry' within its `expiring_fifo'.
Fixes#5211
In 79935df959 replay apply-call was
changed from one with no continuation to one with. But the frozen
mutation arg was still just lambda local.
Change to use do_with for this case as well.
Message-Id: <20191203162606.1664-1-calle@scylladb.com>
Exception messages contain semaphore's name (provided in ctor).
This affects the queue overflow exception as well as timeout
exception. Also, custom throwing function in ctor was changed
to `prethrow_action', i.e. metrics can still be updated there but
now callers have no control over the type of the exception being
thrown. This affected `restricted_reader_max_queue_length' test.
`reader_concurrency_semaphore'-s docs are updated accordingly.
In a build configured with --debuginfo 0 the scylla binary still ends
up with some debug info from the libraries that are statically linked
in.
We should avoid compiling subprojects (including seastar) with debug
info when none is needed, but this at least avoids it showing up in
the binary.
The main motivation for this is that it is confusing to get a binary
with *some* debug info in it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191127215843.44992-1-espindola@scylladb.com>
This series refactors the collection de/serialization code to use
fragmented buffers, avoiding the large allocations and the associated
pains when working with large collections. Currently all operations that
involve collections require deserializing them, executing the operation,
then serializing them again to their internal storage format. The
de/serialization operations happen in linearized buffers, which means
that we have to allocate a buffer large enough to hold the *entire*
collection. This can cause immense pressure on the memory allocator,
which, in the face of memory fragmentation, might be unable to serve the
allocation at all. We've seen this causing all sorts of nasty problems,
including but not limited to: failing compactions, failing memtable
flush, OOM crash and etc.
Users are strongly discouraged from using large collections, yet they
are still a fact of life and have been haunting us since forever.
The proper solution for these problems would be to come up with an
in-memory format for collections, however that is a major effort, with a
lot of unknowns. This is something we plan on doing at some point but
until it happens we should make life less painful for those with large
collections.
The goal of this series is to avoid the need of allocating these large
buffers. Serialization now happens into a `bytes_ostream` which
automatically fragments the values internally. Deserialization happens
with `utils::linearizing_input_stream` (introduced by this series), which
linearizes only the individual collection cells, but not the entire
collection.
An important goal of this series was to introduce the least amount of
risk, and hence the least amount of code. This series does not try to
make a revolution and completely revamp and optimize the
de/serialization codepaths. These codepaths have their days numbered so
investing a lot of effort into them is in vain. We can apply incremental
optimizations where we deem it necessary.
Fixes: #5341
Support to delete multiple keys in one DEL command.
The feature of returning number of the really deleted keys is still not supported.
Return empty string to client for GET command when the required key is not exists.
Fixes: #5334
Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
Rename option redis_transport_port to redis_port, which the redis transport listens on for clients.
Rename option redis_transport_port_ssl to redis_ssl_port, which the redis TLS transport listens on for clients.
Rename option redis_database_count. Set the redis dabase count.
Rename option redis_keyspace_opitons to redis_keyspace_replication_strategy_options. Set the replication strategy for redis keyspace.
Remove option enable_redis_protocol, which is unnecessary.
Fixes: #5335
Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
Commit 96009881d8 added diffutils to the dependencies via
Seastar's install-dependencies.sh, after it was inadvertantly
dropped in 1164ff5329 (update to Fedora 31; diffutils is no
longer brought in as a side effect of something else).
Regenerate the image to include diffutils.
Ref #5401.
This patch added subtests for EOF process, it reads and writes the socket
directly by using protocol cmds.
We can add more tests in future, tests with Redis module will hide some
protocol error.
Signed-off-by: Amos Kong <amos@scylladb.com>
podman needs to relabel directories in exactly the same cases docker
does. The difference is that podman cannot relabel /tmp.
The reason it was working before is that in practice anyone using
dbuild has already relabeled any directories that need relabeling,
with the exception of /tmp, since it is recreated on every boot.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191201235614.10511-2-espindola@scylladb.com>
Use `utils::linearizing_input_stream` for the deserizalization of the
collection. Allows for avoiding the linearization of the entire cell
value, instead only linearizing individual values as they are
deserialized from the buffer.
`linearizing_input_stream` allows transparently reading linearized
values from a fragmented buffer. This is done by linearizing on-the-fly
only those read values that happen to be split across multiple
fragments. This reduces the size of the largest allocation from the size
of the entire buffer (when the entire buffer is linearized) to the size
of the largest read value. This is a huge gain when the buffer contains
loads of small objects, and modest gains when the buffer contains few
large objects. But the even in the worst case the size of the largest
allocation will be less or equal compared to the case where the entire
buffer is linearized.
This stream is planned to be used as glue code between the fragmented
cell value and the collection deserialization code which expects to be
reading linearized values.
Currently the loop which writes the data from the fragmented origin to
the destination, moves to the next chunk eagerly after writing the value
of the current chunk, if the current chunk is exhausted.
This presents a problem when we are writing the last piece of data from
the last chunk, as the chunk will be exhausted and we eagerly attempt to
move to the next chunk, which doesn't exist and dereferencing it will
fail. The solution is to not be eager about moving to the next chunk and
only attempt it if we actually have more data to write and hence expect
more chunks.
The presence of `const_iterator` seems to be a requirement as well
although it is not part of the concept. But perhaps it is just an
assumption made by code using it.