Commit Graph

11716 Commits

Author SHA1 Message Date
Pekka Enberg
1d5f7be447 systemd: Use PermissionsStartOnly instead of running sudo
Use the PermissionsStartOnly systemd option to apply the permission
related configurations only to the start command. This allows us to stop
using "sudo" for ExecStartPre and ExecStopPost hooks and drop the
"requiretty" /etc/sudoers hack from Scylla's RPM.

Tested-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1466407587-31734-1-git-send-email-penberg@scylladb.com>
2016-06-20 11:53:24 +03:00
Vlad Zolotarov
baf3614e8f sstables: don't backup sstables that are a result of a compaction
According to incremental backup description
(http://docs.datastax.com/en/cassandra_win/2.2/cassandra/operations/opsBackupIncremental.html)
sstables that are a result of a compaction process should not
be backed up since original sstables had already been backed up.

Fixes #1308

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <1466338622-7323-1-git-send-email-vladz@cloudius-systems.com>
2016-06-20 09:52:30 +03:00
Pekka Enberg
f4153c75a0 cql3: Bump CQL language version to 3.2.1
We already added 3.2.1 support in commit 569d288 ("cql3: Add TRUNCATE
TABLE alias for TRUNCATE") but never got around fixing the CQL version
reported to drivers.

Fixes #1358.

Message-Id: <1466403967-28654-1-git-send-email-penberg@scylladb.com>
2016-06-20 09:42:12 +03:00
Avi Kivity
07045ffd7c dist: fix scylla-kernel-conf postinstall scriptlet failure
Because we build on CentOS 7, which does not have the %sysctl_apply macro,
the macro is not expanded, and therefore executed incorrectly even on 7.2,
which does.

Fix by expanding the macro manually.

Fixes #1360.
Message-Id: <1466250006-19476-1-git-send-email-avi@scylladb.com>
2016-06-20 09:36:39 +03:00
Lucas Meneghel Rodrigues
ae622b0c08 dist/common/scripts/scylla_kernel_check: Update messages
Small grammar tweaks to the script's output messages.

Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
Message-Id: <1466205496-3885-3-git-send-email-lmr@scylladb.com>
2016-06-19 19:28:58 +03:00
Lucas Meneghel Rodrigues
aacf7eb2ae dist/common/scripts/scylla_kernel_check: Fix conditional statement
Since most of the time people are running scylla_setup on
a fully upgraded ubuntu 14.04 box, we rarely reach that
code path, but once we do we end up with an error. Let's
fix that.

Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
Message-Id: <1466205496-3885-2-git-send-email-lmr@scylladb.com>
2016-06-19 19:28:56 +03:00
Nadav Har'El
faa45812b2 Rewrite shared sstables only after entire CF is read
Starting in commit 721f7d1d4f, we start "rewriting" a shared sstable (i.e.,
splitting it into individual shards) as soon as it is loaded in each shard.

However as discovered in issue #1366, this is too soon: Our compaction
process relies in several places that compaction is only done after all
the sstables of the same CF have been loaded. One example is that we
need to know the content of the other sstables to decide which tombstones
we can expire (this is issue #1366). Another example is that we use the
last generation number we are aware of to decide the number of the next
compaction output - and this is wrong before we saw all sstables.

So with this patch, while loading sstables we only make a list of shared
sstables which need to be rewritten - and the actual rewrite is only started
when we finish reading all the sstables for this CF. We need to do this in
two cases: reboot (when we load all the existing sstables we find on disk),
and nodetool referesh (when we import a set of new sstables).

Fixes #1366.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1466344078-31290-1-git-send-email-nyh@scylladb.com>
2016-06-19 16:50:51 +03:00
Paweł Dziepak
dde87e0b0e row_cache: drop schema upgrade for new entries in update()
Commit daad2eb "row_cache: fix memory leak in case of schema upgrade
failure" has fixed a memory leak caused by failed upgrade_entry().
However, in case of upgrade failure memtable_entry used to create the
new cache entry was left in some invalid state. If the operation was
retried the cache would attempt again to apply that memtable_entry which
now would be in invalid state.

The solution is to either to ignore upgrade_entry() exceptions or do not
call it at all and let the cache entry be upgraded on demand. This patch
implements the latter.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1466163435-27367-1-git-send-email-pdziepak@scylladb.com>
2016-06-17 13:43:01 +02:00
Paweł Dziepak
daad2ebf81 row_cache: fix memory leak in case of schema upgrade failure
When update() causes a new entry to be inserted to the cache the
procedure is as follows:
1. allocate and construct new entry
2. upgrade entry schema
3. add entry to lru list and cache tree

Step 2 may fail and at this point the pointer to the entry is neither
protected by RAII nor added in any of the cache containers. The solution
is to swap steps 2 and 3 so that even if the upgrade fails the entry is
already owned by the cache and won't leak.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1466161709-25288-1-git-send-email-pdziepak@scylladb.com>
2016-06-17 13:12:01 +02:00
Asias He
4f3ce42163 storage_service: Prevent old version node to join a new version cluster
We want to prevent older version of scylla which has fewer features to
join a cluster with newer version of scylla which has more features,
because when scylla sees a feature is enabled on all other nodes, it
will start to use the feature and assume existing nodes and future nodes
will always have this feature.

In order to support downgrade during rolling upgrade, we need to support
mixed old and new nodes case.

1) All old nodes
O O O O O <- N   OK
O O O O O <- O   OK

2) All new nodes
N N N N N <- N   OK
N N N N N <- O   FAIL

3) Mixed old and new nodes
O N O N O <- N   OK
O N O N O <- O   OK

(O == old node, N == new node, <- == joining the cluster)

With this patch, I tested:

1.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}

1.2) Add old node to old node cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}

2.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}

2.2) Add old node to new node cluster
seastar - Exiting on unhandled exception: std::runtime_error (Feature
check failed. This node can not join the cluster because it does not
understand the feature. Local node 127.0.0.4 features = {}, Remote
common_features = {RANGE_TOMBSTONES})

3.1) Add new node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {}

3.2) Add old node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}

Fixes #1253
2016-06-17 10:49:45 +08:00
Asias He
32ed468e42 gossip: Remove empty string feature in get_supported_features
If the feature string is empty, boost::split will return
std::set<sstring> = {""} instead of std::set<sstring> = {}
which will make a node with a feaure, e.g. std::set<sstring> =
{"RANGE_TOMBSTONES"}, think it does not understand the feature of
a node with no features at all.
2016-06-17 10:49:45 +08:00
Gleb Natapov
4659800ab9 storage_proxy: implement custom speculative retry strategy
User may specify time after which speculative retry should happen
instead of relying on cf statics. Use provided value in speculative
executor.

Message-Id: <20160616104422.GH5961@scylladb.com>
2016-06-16 13:45:56 +03:00
Pekka Enberg
d72c608868 service/storage_service: Make do_isolate_on_error() more robust
Currently, we only stop the CQL transport server. Extract a
stop_transport() function from drain_on_shutdown() and call it from
do_isolate_on_error() to also shut down the inter-node RPC transport,
Thrift, and other communications services.

Fixes #1353
2016-06-16 13:34:09 +03:00
Avi Kivity
85bb5ea064 Merge "Reduce LSA reclaim latency" from Tomasz
"Reclaiming many segments was observed to cause up to multi-ms
latency. With the new setting, the latency of reclamation cycle with
full segments (worst case mode) is below 1ms.

I saw no difference in throughput in a CQL write micro benchmark
in neither of these workloads:
 - full segments, reclaim by random eviction
 - sparse segments (3% occupancy), reclaim by compaction and no eviction

Fixes #1274."
2016-06-16 10:47:57 +03:00
Pekka Enberg
a8f95e8081 dist/docker: Use Scylla superpackage for installation
Make the Dockerfile more future-proof by using the Scylla superpackage
for installation.

Message-Id: <1466015996-19792-1-git-send-email-penberg@scylladb.com>
2016-06-16 10:32:18 +03:00
Benoît Canet
c133748a24 scylla_setup: Fix RAID device enumeration
Commit f42673ed1e ("scylla_setup: Hide
busy block devices from RAID0 configuration") wasn't enumerating
anything.  Additionally it listed from /dev/ and not /dev/dm which broke
the tests conditions.

This one uses blkid instead of /proc/partitions.

A follow up patch will be required to mask encrypted devices.

Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1466059657-12377-1-git-send-email-benoit@scylladb.com>
2016-06-16 09:52:25 +03:00
Glauber Costa
01a658f51d LSA: helper function for region_group
current hierarchy walk converted, but more users will come.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-06-15 22:26:50 -04:00
Glauber Costa
741aa16748 LSA: allow a region_group to have a threshold for throttling specified
Allocations will still be allowed if made directly, but callers will have the
choice (in an upcoming patch) to proceed only if memory is below this threshold.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-06-15 22:26:50 -04:00
Glauber Costa
7cd0c0731e region_group: delete move constructor
Tomek correctly points out that since we are now using "this" in lambda
captures, we should make the region_group not movable. We currently define a
move constructor, but there are no users. So we should just remove them.

copy constructor is already deleted, and so are the copy and move assignment
operators. So by removing the move constructor, we should be fine.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-06-15 22:26:50 -04:00
Benoît Canet
0cf8144485 scylla_setup: Propose defaults values when judicious
Also takes care of explaining the options.

Fixes #1031

Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1466011848-11054-1-git-send-email-benoit@scylladb.com>
2016-06-15 20:33:55 +03:00
Benoît Canet
263a55c0da scylla_setup: Inform the user that he can skip any step
Fixes: #1188

Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1466001423-9547-3-git-send-email-benoit@scylladb.com>
2016-06-15 19:38:23 +03:00
Benoît Canet
f42673ed1e scylla_setup: Hide busy block devices from RAID0 configuration
This patch look in /proc/mount for the device name so
the device or it's subdevices will be excluded from the availables
RAID0 targets. It does the same with physical volume from device
mapper.

Fixes #1189
Message-Id: <1466001423-9547-4-git-send-email-benoit@scylladb.com>
2016-06-15 19:36:11 +03:00
Paweł Dziepak
c8e75d2e84 schema: cache is_atomic() in column_definition
is_atomic() is called for each cell in mutation applies, compaction
and query. Since the value doesn't change it can be easily cached which
would save one indirection and virtual call.

Results of perf_simple_query -c1 (median, duration 60):
         before      after
read   54611.49   55396.01   +1.44%
write  65378.92   68554.25   +4.86%

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1465991045-11140-1-git-send-email-pdziepak@scylladb.com>
2016-06-15 19:18:13 +03:00
Benoît Canet
4def1f4524 dist: sysctl.d: Disable automatic numa balancing
On NUMA hardware, autonuma may reduce performance by
unmapping memory.

Since we do manual NUMA placement, autonuma will not
help anything.

We ought to disable it by setting the kernel.numa_balancing
sysctl to 0.

Fixes: #1120

Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1466006345-9972-1-git-send-email-benoit@scylladb.com>
2016-06-15 19:11:00 +03:00
Gleb Natapov
7f54333c45 storage_proxy: fix complication on older boost
boost before 1.56.0 had broken boost:size() implementation. Do not use
it.

Message-Id: <20160615123134.GD5961@scylladb.com>
2016-06-15 15:34:57 +03:00
Asias He
de0fd98349 repair: Switch log level to warn instead of error
dtest takes error level log as serious error. It is not a serious error
for streaming to fail to send a verb and fail a streaming session which
triggers a repair failure, for example, the peer node is gone or
stopped. Switch to use log level warn instead of level error.

Fixes repair_additional_test.py:RepairAdditionalTest.repair_kill_3_test

Fixes: #1335
Message-Id: <406fb0c4a45b81bd9c0aea2a898d7ca0787b23e9.1465979288.git.asias@scylladb.com>
2016-06-15 13:01:35 +03:00
Asias He
94c9211b0e streaming: Switch log level to warn instead of error
dtest takes error level log as serious error. It is not a serious error
for streaming to fail to send a verb and fail a streaming session, for
example, the peer node is gone or stopped. Switch to use log level warn
instead of level error.

Fixes repair_additional_test.py:RepairAdditionalTest.repair_kill_3_test

Fixes: #1335
Message-Id: <0149d30044e6e4d80732f1a20cd20593de489fc8.1465979288.git.asias@scylladb.com>
2016-06-15 13:01:22 +03:00
Vlad Zolotarov
c616e74ae4 locator::gossiping_property_file_snitch: use a lowres_clock time source for a timer
gossiping_property_file_snitch checks a configuration file every 60s.
lowres_clock clock source should be good enough for that.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1465314448-11611-1-git-send-email-vladz@cloudius-systems.com>
2016-06-15 13:01:05 +03:00
Tomasz Grabiec
207c8d94f1 idl: Rename variable to a more meaningful name
Message-Id: <1465909911-10534-2-git-send-email-tgrabiec@scylladb.com>
2016-06-14 17:02:59 +03:00
Raphael S. Carvalho
80d8c5ef6f compaction: use proper type in constructor
Correctness is not affected due to long type, but an unsigned
long type should be definitely used instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d3ab15a3206306de195aeb3d78f9b5bc4ca9208e.1465908970.git.raphaelsc@scylladb.com>
2016-06-14 17:02:32 +03:00
Tomasz Grabiec
8e8f63de85 mutation_partition_view: Avoid unnecessary copy into temporary
Message-Id: <1465909038-8174-1-git-send-email-tgrabiec@scylladb.com>
2016-06-14 17:02:17 +03:00
Tomasz Grabiec
75f899cc93 lsa: Make reclamation step configurable via config 2016-06-14 15:13:15 +02:00
Tomasz Grabiec
cd9955d2ce lsa: Reclaim 1 segment by default
Reclaiming many segments was observed to cause up to multi-ms
latency. With the new setting, the latency of reclamation cycle with
full segments (worst case mode) is below 1ms.

I saw no decrease in throughput compared to the step of 16 segments in
neither of these modes:
  - full segments, reclaim by random evicition
  - sparse segments (3% occupancy), reclaim by compaction and no eviction

Fixes #1274.
2016-06-14 15:13:15 +02:00
Tomasz Grabiec
86b76171a8 lsa: Use the same step in both internal and external reclamations 2016-06-14 15:13:15 +02:00
Tomasz Grabiec
d74d902a01 lsa: Make reclamation step configurable 2016-06-14 15:13:14 +02:00
Tomasz Grabiec
93bb95bd0d lsa: Log reclamation rate 2016-06-14 15:13:14 +02:00
Tomasz Grabiec
cb18418022 lsa: Print more details before aborting 2016-06-14 15:13:14 +02:00
Tomasz Grabiec
7cb98c916f tests: lsa_async_eviction_test: Push to refs with reclaim lock
push_back() is not reentrant with pop_front(), used by the evictor. If
reclaimer runs when std::deque allocates a new node it will get
corrupted. Fix by runnning push_back() under reclaim lock.
2016-06-14 15:13:14 +02:00
Tomasz Grabiec
de8772525a tests: lsa_async_eviction_test: Make sure refs scope encloses reclaimer scope 2016-06-14 15:13:14 +02:00
Tomasz Grabiec
c4a556ac13 tests: lsa_async_eviction_test: Fix use after free due to at_exit() callback
The callback will run after thread is destroyed. We don't really need
the stop feature, so for now just remove it.
2016-06-14 15:13:14 +02:00
Pekka Enberg
155ad2eeb5 storage_service: Fix start_rpc_server() to use logger
Message-Id: <1465882880-7392-1-git-send-email-penberg@scylladb.com>
2016-06-14 09:52:04 +02:00
Raphael S. Carvalho
0b2cd41daf database: remember sstable level when cleaning it up
Cleanup operation wasn't preserving level of sstables. That will have
a bad impact on performance because compaction work is lost.

Fixes #1317.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <35ce8fbbb4590725bb0414e6a5450fcbe6cb7212.1465843387.git.raphaelsc@scylladb.com>
2016-06-14 08:06:00 +03:00
Vlad Zolotarov
d3960f0bbb tracing: rearrange shut down
tracing::tracing local instance is dereferenced from a
cql_server::connection::process_request(), therefore tracing::tracing
service may be stop()ed only after a CQL server service is down.
On the other hand it may not be stopped before RPC service is down
because a remote side may request a tracing for a specific command too.

This patch splits the tracing::tracing stop() into two phases:
   1) Flush all pending tracing records and stop the backend.
   2) Stop the service.

The first phase is called after CQL server is down and before RPC is down.
The second phase is called after RPC is down.

Fixes #1339

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1465840496-19990-1-git-send-email-vladz@cloudius-systems.com>
2016-06-14 07:58:04 +03:00
Avi Kivity
49449fc30c Merge seastar upstream
* seastar 864d6dc...401c333 (8):
  > scollectd: Support filtering specific collectd metrics
  > core: Integrate error reporting with the logging framework
  > rpc: wait for all replies to be completed before closing rpc server
  > rpc: clean up resource accounting
  > queue: fix race between pop_eventually() and abort()
  > rpc_test: fix cancel test to not depend on timing.
  > tutorial: explain application-specific command line options
  > add ostream output operator for std::unordered_map
2016-06-13 19:35:00 +03:00
Gleb Natapov
e089166cfa storage_proxy: wait only for expected CL when writing back data during read repair
When read repair writes diffs back to replicas it is enough to wait
for requested CL to guaranty read monotonicity. This patch makes read
repair write reuse regular mutate functionality which already tracks
CL status. This is done by changing write response handler to not hold
mutation directly, but instead hold a container that, depending on
whether
this is read repair write or regular one, can provide different mutation
per destination.

Message-Id: <20160613124727.GL1096@scylladb.com>
2016-06-13 19:01:51 +03:00
Duarte Nunes
c896309383 database: Actually decrease query_state limit
query_state expects the current row limit to be updated so it
can be enforced across partition ranges. A regression introduced
in e4e8acc946 prevented that from
happening by passing a copy of the limit to querying_reader.

This patch fixes the issue by having column_family::query update
the limit as it processes partitions from the querying_reader.

Fixes #1338

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1465804012-30535-1-git-send-email-duarte@scylladb.com>
2016-06-13 10:03:27 +02:00
Avi Kivity
465c0a4ead Merge "Make stronger guarantees in row_cache's clear/invalidate" from Tomasz
"Correctness of current uses of clear() and invalidate() relies on fact
that cache is not populated using readers created before
invalidation. Sstables are first modified and then cache is
invalidated. This is not guaranteed by current implementation
though. As pointed out by Avi, a populating read may race with the
call to clear(). If that read started before clear() and completed
after it, the cache may be populated with data which does not
correspond to the new sstable set.

To provide such guarantee, invalidate() variants were adjusted to
synchronize using _populate_phaser, similarly like row_cache::update()
does.

Fixes #1291."
2016-06-13 09:55:29 +03:00
Shlomi Livne
ac6f2b5c13 dist/common: Update scylla_io_setup to use settings done in cpuset.conf
scylla_io_setup is searching for --smp and --cpuset setting in
SCYLLA_ARGS. We have moved the settings of this args into
/etc/scylla.d/cpuset.conf and they are set by scylla_cpuset_setup into
CPUSET.

Fixes: #1327

Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <2735e3abdd63d245ec96cfa1e65f766b1c12132e.1465508701.git.shlomi@scylladb.com>
2016-06-10 09:37:44 +03:00
Vlad Zolotarov
89375d4c2a service::storage_proxy: tracing: instrument read_digest and read_mutation_data
Instrument read_digest and read_mutation_data handlers similarly
to a read_data handler instrumentation.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1465304055-4263-1-git-send-email-vladz@cloudius-systems.com>
2016-06-09 14:32:42 +02:00
Pekka Enberg
8df5aa7b0c utils/exceptions: Whitelist EEXIST and ENOENT in should_stop_on_system_error()
There are various call-sites that explicitly check for EEXIST and
ENOENT:

  $ git grep "std::error_code(E"
  database.cc:                            if (e.code() != std::error_code(EEXIST, std::system_category())) {
  database.cc:            if (e.code() != std::error_code(ENOENT, std::system_category())) {
  database.cc:        if (e.code() != std::error_code(ENOENT, std::system_category())) {
  database.cc:                            if (e.code() != std::error_code(ENOENT, std::system_category())) {
  sstables/sstables.cc:            if (e.code() == std::error_code(ENOENT, std::system_category())) {
  sstables/sstables.cc:            if (e.code() == std::error_code(ENOENT, std::system_category())) {

Commit 961e80a ("Be more conservative when deciding when to shut down
due to disk errors") turned these errors into a storage_io_exception
that is not expected by the callers, which causes 'nodetool snapshot'
functionality to break, for example.

Whitelist the two error codes to revert back to the old behavior of
io_check().
Message-Id: <1465454446-17954-1-git-send-email-penberg@scylladb.com>
2016-06-09 10:03:04 +02:00