This patch uses cf_properties instead to add the missing attributes to
the create_view_statement class.
Fixes#1766
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the VIEWS element to the cause enum so we can
mark failures due to incomplete support of materialized views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts the definition of the default compressor into the
compression_parameters class, so that the table and view creation
statements don't have to explicitly deal with it.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts the cf_properties class, which contains common
attributes of tables and materialized views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Since we are exiting Scylla process in engine().at_exit() using
::_exit(0), even verify_seastar_io_scheduler() throwing an exception,
scylla always exit with 0.
Systemd misunderstands scylla-server.service was shutdown successfully
because of this, so we need to pass correct exit code to ::_exit() here.
Fixes#1674
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475065607-15486-1-git-send-email-syuu@scylladb.com>
* seastar 207bf3d...ccd8649 (3):
> Merge "Augment semaphore with non-blocking operations" from Glauber
> Merge "More dynamic fstream patches" from Paweł
> Merge "fstream: add dynamic adjustments based on stream history" from Paweł
A 1MB response will require 2000 allocations with the current 512-byte
chunk size. Increase it exponentially to reduce allocation count for
larger responses (still respecting the upper limit).
Message-Id: <1476369152-1245-1-git-send-email-avi@scylladb.com>
Memory accounting code was attaching partition_snapshot to
partition_entry in order to calculate the size of partition_version
object. However, it is only allowed if partition_entry doesn't have
any snapshot attached already. In this case it always has one, created
by the flushing reader.
Change the accounting code to reuse existing partition_snapshot reference.
Fixes#1746
Message-Id: <1476449160-9252-1-git-send-email-tgrabiec@scylladb.com>
LSA tries to allocate zones as large as possible (while still leaving
enough free space for the standard allocator). It uses the amount of
free memory in order to guess how much it can get, but that obviously
doesn't account for fragmentation and the allocation attempt may fail.
This patch changes the LSA code so that it doesn't throw in case zone
couldn't be created but just returns a null pointer which should be
more performant if the LSA memory cannot grow any more.
Fixes#1394.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1476435031-5601-1-git-send-email-pdziepak@scylladb.com>
The expected behaviour in the scylla_setup script is that a question
will be followed by the answer.
For example, after asking if the scylla should be run as a service the
relevant actions will be taken before the following question.
This patch address two such mis-orders:
1. the scylla-housekeeping depends on the scylla-server, but the
setup should first setup the scylla-server service and only then ask
(and install if needed) the scylla-housekeeping.
2. The node_exporter should be placed after the io_setup is done.
Fixes#1739
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1476370098-25617-1-git-send-email-amnon@scylladb.com>
Change abstract_replication_strategy::create_replication_strategy() to
throw exceptions::configuration_error if replication strategy class
lookup to make sure the error is converted to the correct CQL response.
Fixes#1755
Message-Id: <1476361262-28723-1-git-send-email-penberg@scylladb.com>
* seastar f937fb0...207bf3d (11):
> Merge "iotune: gracefully exit on predictable exceptions" (Fixes#1623)
> core/semaphore: Add semaphore_units::release()
> Merge "rometheus API with grafana uses labels" from Amnon
> core/thread: Fix stack alloc-dealloc mismatch
> core/thread: Make jmp_buf_link::yield_at use the same time point as thread_scheduling_group
> file: support for XFS on older kernels
> reactor: fix bug when handling EBADF in flush_pending_aio()
> prometheus CPU should start in 0
> Collectd: bytes ordering depends on the type
> tests: Check that backtrace() doesn't corrupt signal mask
> core/thread: Add stack guards to seastar thread stacks
If we have a range query involving a wrapping range (i.e., from thrift),
and mutations from both halves of the result are involved, then
we will return the results in the wrong order (and potentially the wrong
partitions) since we order by token, so the results from the second half
of the wrapping range end up before the first.
Fix by splitting the two queries, and merging the second half with lower
priority compared to the first half.
Note: this will be fixed in a better way once we have the sharding iterator,
as then we can query sequentially.
Fixes#1761.
Message-Id: <1476262693-30162-1-git-send-email-avi@scylladb.com>
"This series address two issues that interfere with running the node_exporter as a service in ubuntu 16.
1. The service file should be packed in the deb file
2. When setting the node_exporter as a service it doesn't need to run with scylla use"
* 'amnon/node_exporter_ubuntu_v2' of github.com:cloudius-systems/seastar-dev:
node-exporter service: No need to run as scylla user
debian package: Include the node_exporter service file
the node-exporter does not need to run as scylla user. It can run
without scylla or without the scylla user being configure.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"This patch-set re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.
Ref #1139
Ref #693"
* 'describe-splits/v2' of github.com:duarten/scylla:
thrift: Implement describe_splits_ex based on Cassandra
storage_service: Implement get_splits() function
sstables: Add function to get key samples
sstables/key: Add to_partition_key function
size_estimates_recorder: Increase estimate accuracy
sstables: Get estimates for a particular range
sstables/key: Make key::kind public
The script mistakenly split value at "," when cpuset list is separated
by comma. Instead of matching possible patterns of the argument, let's
pass all characters until reach to space delimiter or end of line.
Fixes#1716
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476171037-32373-1-git-send-email-syuu@scylladb.com>
This patch re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.
Ref #1139
Ref #693
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements the get_splits() function in storage_service,
used to split a particular token range in slices of approximately the
specified size, using the sample keys and estimates of the CF's
sstables.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements the get_key_samples() function, on which a
future patch will base an implementation of the describe_splits()
thrift verb closer to Cassandra's.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the estimated_keys_for_range() function, which
estimates the number of keys present between the specified range.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"The version is taken from the installation rather than the API, a mode command
line indicated that this is part of the setup and uuid is used for the
interaction with the checkversion server."
* 'amnon/check_version_on_startup_v3' of github.com:cloudius-systems/seastar-dev:
scylla_setup: Check and report the scylla version
scylla-housekeeping: check version during setup
There is already queue_length-requests_blocked_memory, but it's a
gauge so does not reflect what happened between the sampling points.
total_operations-requests_blocked_memory will allow to see if there
were any (and how many) requests which were blocked by dirty memory.
Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>
Presents current heap profile recording.
Works in text mode or dumps to collapsed stacks format from which
flame graph can be generated.
To generate a flamegraph:
(gdb) scylla heapprof --flame
Wrote heapprof.stacks
$ flamegraph.pl --colors mem < heapprof.stacks > heapprof.svg
flamegraph.pl comes from:
https://github.com/brendangregg/FlameGraph.git
Text mode example:
(gdb) scylla heapprof --min 100000000
All (274699676, #10213)
\-- void* memory::cpu_pages::allocate_large_and_trim<memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}>(unsigned int, memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}) + 169 (268435456, #1)
memory::allocate_large_aligned(unsigned long, unsigned long) + 87
memory::allocate_aligned(unsigned long, unsigned long) + 48
aligned_alloc + 9
logalloc::segment_zone::segment_zone() + 304
logalloc::segment_pool::allocate_segment() + 477
logalloc::segment_pool::segment_pool() + 304
__tls_init.part.801 + 72
logalloc::region_group::release_requests() + 1333
logalloc::region_group::add(logalloc::region_group*) + 514
The branches are formatted like this:
-- <symbol> (<size>, #<count>)
Where <size> is total size of live objects and <count> is total
number of live objects, for all objects allocated from paths going
through this node.
Nodes which share the same <size> and <count> are stacked like this:
-- <symbol_1> (<size>, #<count>)
<symbol_2>
<symbol_3>
Message-Id: <1475583334-19524-1-git-send-email-tgrabiec@scylladb.com>
Limiting the concurrency of memtable flushes to 4 was a temporary
workaround for the fact that we lacked good write behind support. Now
that write behind is properly merged we can reduce the concurrency to
what it should be, one.
This means that memtable flushes will now be serialized, and only when
one of them ends will the next one begin. Disk parallelism is obtained
through the write-behind mechanism.
Fixes#1373
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <528f9ef928b5101bed952df600eb8555c275497a.1475881100.git.glauber@scylladb.com>
There is a limit to concurrency of sstable readers on each shard. When
this limit is exhausted (currently 100 readers) readers queue. There
is a timeout after which queued readers are failed, equal to
read_request_timeout_in_ms (5s by default). The reason we have the
timeout here is primarily because the readers created for the purpose
of serving a CQL request no longer need to execute after waiting
longer than read_request_timeout_in_ms. The coordinator no longer
waits for the result so there is no point in proceeding with the read.
This timeout should not apply for readers created for streaming. The
streaming client currently times out after 10 minutes, so we could
wait at least that long. Timing out sooner makes streaming unreliable,
which under high load may prevent streaming from completing.
The change sets no timeout for streaming readers at replica level,
similarly as we do for system tables readers.
Fixes#1741.
Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>
Make split_after() more generic by allowing split_point to be anywhere,
not just within the input range. If the split_point is before, the entire
range is returned; and if it is after, stdx::nullopt is returned.
"before" and "after" are not well defined for wrap-around ranges, so
but we are phasing them out and soon there will not be
wrapping_range::split_after() users.
This is a prerequisite for converting partition_range and friends to
nonwrapping_range.
Message-Id: <1475765099-10657-1-git-send-email-avi@scylladb.com>
Commit log replay is a synchronous operation in bootstrap, so services
will only be started after it's completed. By starting compaction before,
less bandwidth will be available to both and consequently boot will be
slowed down. Fix is simply about moving compaction, which is an
asynchronous operation after commitlog replay is over.
Fixes#1620.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d2a173a4ee4d474317b970c6b39530e61067fea9.1475527955.git.raphaelsc@scylladb.com>
This patch adds the parsing for the "CREATE MATERIALIZED VIEW" statement,
following Cassandra 3 syntax. For example:
CREATE MATERIALIZED VIEW building_by_city
AS SELECT * FROM buildings
WHERE city IS NOT NULL
PRIMARY KEY(city, name);
It also adds the "IS NOT NULL" operator needed for this purpose.
As in Cassandra, "IS NOT NULL" can only be used for materialized
view creation, and not in a normal SELECT. It can only be used with
the NULL operand (i.e., "IS NOT 3" will be a syntax error).
The current implementation of this statement just does some sanity
checking (such as to verify that "city" is a valid column name and that
the "building" base table exists), complains that materialized views are
not yet supported:
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="Failed parsing statement: [CREATE MATERIALIZED VIEW building_by_city AS
SELECT * FROM buildings
WHERE city IS NOT NULL
PRIMARY KEY(city, name);] reason: unsupported operation: Materialized views not yet supported">
As mentioned above, the "IS NOT NULL" restriction is not allowed in
ordinary selects not creating a materialized views:
SELECT * FROM buildings WHERE city IS NOT NULL;
InvalidRequest: code=2200 [Invalid query] message="restriction 'city IS NOT null' is only supported in materialized view creation"
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1475742927-30695-1-git-send-email-nyh@scylladb.com>
The latest virtual dirty patches broke the SSTable tests. The reason for
this is that those tests will flush synthetic memtables that do not have
a region_group attached to it.
Normally in cases like this we would just give the flush_reader an empty
region group. However, the memtable class constructor takes a
region_group pointer and that can be null according to the interface.
So we must conditionally test it.
If there isn't a region_group involved, the virtual dirty accounting
should be disabled: after all, we won't even have the baseline memory
to begin with.
One of the approaches to fix this could be to just provide null
accounter classes to be used as a surrogate for the accounting classes
in this case. However, since this is mostly used for tests, a much
simpler way is to just revert back to the scanning reader in that case.
The scanning reader is similar enough to the flush_reader, except that
it can handle partial ranges, slices, and delegate accesses to an
sstable post-flush. We don't need any of that, but as argued above,
there is no need to remove it either.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>
Remove inclusions from header files (primary offender is fb_utilities.hh)
and introduce new messaging_service_fwd.hh to reduce rebuilds when the
messaging service changes.
Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>
"Description:
============
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.
The practical effect of that, is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.
Results
=======
With this patchset running a load big enough to easily saturate the disk,
(commitlog disabled to highlight the effects of the memtable writer), I am able
to run scylla for many minutes, with timeouts occurring only when I run out of
disk space, whereas without this patch a swarm of timeouts would start merely 2
seconds after the load started - and would never get stable.
In V2, I have sent a set of graphs illustrating the performance of this solution.
This version does not have any significant differences in that front.
For details, please refer to
https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ
Accuracy of the accounting:
---------------------------
It is important for us to be as accurate as possible when accounting freed
memory, since every byte we mark as freed may allow one or more requests to be
executed. I have measured the accuracy of this approach (ignoring padding,
object size for the mutation fragments) to be 99.83 % of used memory in the
test workload I have ran (large, 65k mutations). Memtables under this circumnstance
tend to have a very high occupancy ratio because throttle breeds idle, and idle
breeds compact-on-idle.
Known Issues:
-------------
A lot of time can be elapsed between destroying the flush_reader and actually
releasing memory. The release of memory only happens when the SSTable is fully
sealed, and we have to flush the files, as well as finish writing all SSTable
components at this point. This happened in practice with a buggy kernel that
would result in flushes taking a long time.
After that is fixed, this is just a theoretical problem and in practice it
shouldn't matter given the time we expect those operations to take."
* 'virtual-dirty-v6' of github.com:glommer/scylla:
database: allow virtual dirty memory management
streamed_mutation: make _buffer private
add accounting of memory read to partition_snapshot_reader
move partition_snapshot_reader code to header file
LSA: allow a group to query its own region group
memtables: split scanning reader in two
sstables: use special reader for writing a memtable
LSA: export information about object memory footprint
LSA: export information about size of the throttle queue
database: export virtual dirty bytes region group
* seastar 18f7bb8...f937fb0 (5):
> Merge "Fix signal mask corruption" from Tomasz
> core/memory: Avoid violating strict aliasing when accessing allocation sites
> core/memory: Avoid indirection when storing allocation sites
> core/memory: Add a way to disable abort on allocation failure in some scope
> core/sharded: Allow mapper to take the service by non-const reference
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.
The practical effect of that is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.
Signed-off-by: Glauber Costa <glauber@scylladb.com>