Commit Graph

22435 Commits

Author SHA1 Message Date
Glauber Costa
45f3bc679e distributed_loader: assume populate_column_families is run in shard 0
This is already the case, since main.cc calls it from shard 0 and
relies on it to spread the information to the other shards. We will
turn this branch - which is always taken - into an assert for the
sake of future-proofing and soon add even more code that relies on this
being executed in shard 0.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:00:27 -04:00
Glauber Costa
bb07678346 api: do not allow user to meddle with auto compaction too early
We are about to use the auto compaction property during the
populate/reshard process. If the user toggles it, the database can be
left in a bad state.

There should be no reason why a user would want to set that up this
early. So we'll disallow it.

To do that property, it is better if the check of whether or not
the storage service is ready to accomodate this request is local
to the storage service itself. We then move the logic of set_tables_autocompaction
from api to the storage service. The API layer now merely translates
the table names and pass it along.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:00:25 -04:00
Glauber Costa
1c70a7c54e upload: use custom error handler for upload directory
SSTables created for the upload directory should be using its custom error
handler.

There is one user of the custom error handler in tree, which is the current
upload directory function. As we will use a free function instead of a lambda
in our implementation we also use the opportunity to fix it for consistency.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-16 19:42:19 -04:00
Glauber Costa
c188aef088 sstable_directory: fix debug message
I just noticed while working on the reshape patches that there
is an extra format bracket in two of the debug message. As they
are debug I've seen them less often than the others and that slipped.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-16 19:42:19 -04:00
Yaron Kaikov
ac7237f991 dbuild: Add an option to run with 'docker' or 'podman'
This adds support for configuring whether to run dbuild with 'docker' or
'podman' via a new environment variable, DBUILD_TOOL. While at it, check
if 'podman' exists, and prefer that by default as the tool for dbuild.
2020-06-16 15:18:46 +03:00
Gleb Natapov
7ca937778d cql transport: do not log broken pipe error when a client closes its side of a connection abruptly
Fixes #5661

Message-Id: <20200615075958.GL335449@scylladb.com>
2020-06-16 13:59:12 +02:00
Nadav Har'El
41a049d906 README: better explanation of dependencies and build
In this patch I rewrote the explanations in both README.md and HACKING.md
about Scylla's dependencies, and about dbuild.

README.md used to mention only dbuild. It now explains better (I think)
why dbuild is needed in the first place, and that the alternative is
explained in HACKING.md.

HACKING.md used to explain *only* install-dependencies.sh - and now explains
why it is needed, what install-dependencies.sh and that it ONLY works on
very recent distributions (e.g., Fedora older than 32 are not supported),
and now also mentions the alternative - dbuild.

Mentions of incorrect requirements (like "gcc > 8.1") were fixed or dropped.

Mention of the archaic 'scripts/scylla_current_repo' script, which we used
to need to install additional packages on non-Fedora systems, was dropped.
The script itself is also removed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200616100253.830139-1-nyh@scylladb.com>
2020-06-16 13:26:04 +02:00
Avi Kivity
bd794629f9 range: rename range template family to interval
nonwrapping_range<T> and related templates represent mathematical
intervals, and are different from C++ ranges. This causes confusion,
especially when C++ ranges and the range templates are used together.

As the first step to disentable this, introduce a new interval.hh
header with the contents of the old range.hh header, renaming as
follows:

  range_bound  -> interval_bound
  nonwrapping_range -> nonwrapping_interval
  wrapping_range -> wrapping_interval
  Range -> Interval (concepts)

The range alias, which previously aliased wrapping_range, did
not get renamed - instead the interval alias now aliases
nonwrapping_interval, which is the natural interval type. I plan
to follow up making interval the template, and nonwrapping_interval
the alias (or perhaps even remove it).

To avoid churn, a new range.hh header is provided with the old names
as aliases (range, nonwrapping_range, wrapping_range, range_bound,
and Range) with the same meaning as their former selves.

Tests: unit (dev)
2020-06-16 13:36:20 +03:00
Piotr Sarna
3bcc2e8f09 Merge 'hinted handoff: improve segment replay logic' from PiotrD
This series contains two improvements to hint file replay logic
in hints manager:

- During replay of a hint file, keeping track of the first hint that fails
  to be sent is now done via a simple std::optional variable instead of an
  unordered_set. This slightly reduces complexity of next replay position
  calculation.
- A corner case is handled: if reading commitlog fails, but there won't be an
  error related to sending hints, starting position wouldn't be updated. This
  could cause us to replay more hints than necessary.

Tests:

- unit(dev)
- dtest(hintedhandoff_additional_test, dev)

* piodul-hints-manager-handle-commitlog-failure-in-replay-position-calculation:
  hinted handoff: use bool instead of send_state_set
  hinted handoff: update replay position on commitlog failure
  hinted handoff: remove rps_set, use first_failed_rp instead
2020-06-16 12:24:55 +02:00
Avi Kivity
6ba7b8f3f5 Update seastar submodule
* seastar 81242ccc3f...8f0858cfd7 (18):
  > Merge 'future, future-utils: stop returning a variadic future from when_all_succeed'
  > file: introduce layered_file_impl, a helper for layered files
  > net: packet: mark move assignment operator as noexcept
  > core: weak_ptr, weakly_referencable: implement empty default constructor
  > circular_buffer: Fix build with gcc 11 (avoid template parameters in d'tor declaration)
  > test: weak_ptr_test: fix static asserts about nothrow constructibility
  > coroutines: Fix clang build
  > cmake: Delete SEASTAR_COROUTINES_TS
  > Merge "future-util: Mark a few more functions as noexcept" from Rafael
  > tests: add a perf test to measure the fair_queue performance
  > Merge "iostream: make iostream stack nothrow move constructible" from Benny
  > future: Move most of rethrow_with_nested out of line.
  > future_test: Add test for nested exceptions in finally
  > core: Add noexcept to unaligned members functions
  > Merge "core: make weak_ptr and checked_ptr default and move nothrow constructible" from Benny
  > core: file: Fix typo in a comment
  > byteorder: Mark functions as noexcept
  > future: replace CanInvoke concepts with std::invocable
2020-06-16 13:19:36 +03:00
Piotr Sarna
e59d41dad6 alternator: use plain function pointer instead of std::function
Since all function handlers are plain functions without any state,
there's no need for wrapping them with a 32-byte std::function
when a plain function pointer would suffice.

Reported-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <913c1de7d02c252b40dc0c545989ec83fe74e5a9.1592291413.git.sarna@scylladb.com>
2020-06-16 12:08:21 +03:00
Raphael S. Carvalho
238ba899c0 compaction_manager: use double for backlog everywhere
Avi says:
"The backlog is a large number that changes slowly, so float
might not have enough resolution to track small changes.

For example, if the backlog is 800GB and changes less than 100kB, then
we won't see a change (float resolution is 2^23 ~ 1:8,000,000).

This is outside the normal range of values (usually the backlog changes
a lot more than 100kB per 15-second period), so it will work, but better
to be more careful."

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200615150621.17543-1-raphaelsc@scylladb.com>
2020-06-16 12:05:05 +03:00
Piotr Sarna
45bf039357 alternator: use has_function instead of try-catch
With the new interface available, the try-catch idiom
can be removed, thus resolving a TODO.

Tests: unit(dev)

Message-Id: <788a29f8f9d7bcf952b28a6148670dbadb97a619.1592233511.git.sarna@scylladb.com>
2020-06-15 23:55:20 +03:00
Piotr Sarna
911dee5417 schema: add has_column utility function
With this simple helper function, a code snippet in alternator
can be transformed from try-catch to a simple condition.

Message-Id: <553debf4e91c0511566e53e2c8a5e8e6ee6552e2.1592233511.git.sarna@scylladb.com>
2020-06-15 23:55:06 +03:00
Piotr Sarna
b1684cf2e1 alternator: move function handlers to a lookup map
Instead of a long chain of `if` statements, handlers are now
created in a static map.
Fixes a TODO in the code.

Tests: unit(dev)

Message-Id: <0ea577a44dd56859da170fe82c16c8f810f9d695.1592232448.git.sarna@scylladb.com>
2020-06-15 23:44:45 +03:00
Piotr Sarna
e76fba6f86 alternator: remove outdated TODO for adding timeouts
The TODO is already fixed, not to mention that it had
an incorrect ordinal number (:
Message-Id: <006dc3061e0f30641c2e63ff471686f4c2e82829.1592230155.git.sarna@scylladb.com>
2020-06-15 23:04:42 +03:00
Tomasz Grabiec
1c5db178dd Merge "logalloc: Get rid of segments migration" from Pavel
But not compaction.

When reclaiming segments to seastar non-empty segments are copied
as-is to some other place. Instead of doing this reclaimer can copy
only allocated objects and leave the freed holes behing, i.e. -- do
the regular compaction. This would be the same or better from the
timing perspective, and will help to avoid yet another compaction
pass over the same set of objects in the future.

Current migration code checks for the free segments reserve to be
above minimum to proceed with migration, so does the code after this
patch, thus the segment compaction is called with non-empty free
segments set and thus it's guaranteed not to fail the new segment
allocation (if it will be required at all).

Plus some bikeshedding patches for the run-up.

tests: unit(dev)

* https://github.com/xemul/scylla/tree/br-logalloc-compact-on-reclaim-2:
  logalloc: Compact segments on reclaim instead of migration
  logallog: Introduce RAII allocation lock
  logalloc: Shuffle code around region::impl::compact
  logalloc: Do not lock reclaimer twice
  logalloc: Do not calculate object size twice
  logalloc: Do not convert obj_desc to migrator back and forth
2020-06-15 16:28:16 +02:00
Glauber Costa
093328741d compaction: test that sstable set is not null in update_pending_ranges
SSTable_set is now an optional, and if we don't want to expire data
it will be empty. We need to check that it is not empty before dereferencing
it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200610170647.142817-1-glauber@scylladb.com>
2020-06-15 15:43:08 +02:00
Tomasz Grabiec
e81fc1f095 row_cache: Fix undefined behavior on key linearization
This is relevant only when using partition or clustering keys which
have a representation in memory which is larger than 12.8 KB (10% of
LSA segment size).

There are several places in code (cache, background garbage
collection) which may need to linearize keys because of performing key
comparison, but it's not done safely:

 1) the code does not run with the LSA region locked, so pointers may
get invalidated on linearization if it needs to reclaim memory. This
is fixed by running the code inside an allocating section.

 2) LSA region is locked, but the scope of
with_linearized_managed_bytes() encloses the allocating section. If
allocating section needs to reclaim, linearization context will
contain invalidated pointers. The fix is to reorder the scopes so
that linearization context lives within an allocating section.

Example of 1 can be found in
range_populating_reader::handle_end_of_stream() where it performs a
lookup:

  auto prev = std::prev(it);
  if (prev->key().equal(*_cache._schema, *_last_key->_key)) {
     it->set_continuous(true);

but handle_end_of_stream() is not invoked under allocating section.

Example of 2 can be found in mutation_cleaner_impl::merge_some() where
it does:

  return with_linearized_managed_bytes([&] {
  ...
    return _worker_state->alloc_section(region, [&] {

Fixes #6637.
Refs #6108.

Tests:

  - unit (all)

Message-Id: <1592218544-9435-1-git-send-email-tgrabiec@scylladb.com>
2020-06-15 16:03:33 +03:00
Nadav Har'El
86a4dfcd29 merge: api: Command to check and repair cdc streams
Merged pull request https://github.com/scylladb/scylla/pull/6551
from Juliusz Stasiewicz:

The command regenerates streams when:

    generations corresponding to a gossiped timestamp cannot be
    fetched from system_distributed table,
    or when generation token ranges do not align with token metadata.

In such case the streams are regenerated and new timestamp is
gossiped around. The returned JSON is always empty, regardless of
whether streams needed regeneration or not.

Fixes #6498
Accompanied by: scylladb/scylla-jmx#109, scylladb/scylla-tools-java#172
2020-06-15 14:17:35 +03:00
Takuya ASADA
ecc83e83e5 scylla_cpuscaling_setup: move the unit file to /etc/systemd
Since scylla-cpupower.service isn't installed by .rpm package, but created
in the setup script, it's better to not use /usr/lib directory, use /etc.

We already doing same way for scylla-server.service.d/*.conf, *.mount, and
*.swap created by setup scripts.
2020-06-15 11:36:20 +03:00
Asias He
61e4387811 repair: Relax node selection in decommission for non network topology strategy
In decommission operation, current code requires a node in local dc to
sync data with. This requirement is too strong for a non network topology
strategy. For example, consider:

   n1 dc1
   n2 dc1
   n3 dc2

n2 runs decommission operation. For a keyspace with simple strategy and
RF = 2, it is possible n3 is the new owner but n3 is not in the same dc
as n2.

To fix, perform the dc check only for the network topology strategy.

Fixes #6564
2020-06-15 11:26:02 +03:00
Avi Kivity
d17b05e911 Merge 'Adding Optimized pseudo floating point estimated histogram' from Amnon
"
This series Adds a pseudo-floating-point histogram implementation.
The histogram is used for time_estimated_histogram a histogram for latency tracking and then used in storage_proxy as a more efficient with a higher resolution histogram.

Follow up series would use the new histogram in other places in the system and will add an implementation that supports lower values.
Fixes #5815
Fixes #4746
"

* amnonh-quicker_estimated_histogram:
  storage_proxy: use time_estimated_histogram for latencies
  test/boost/estimated_histogram_test
  utils/histogram_metrics_helper Adding histogram converter
  utils/estimated_histogram: Adding approx_exponential_histogram
2020-06-15 10:19:36 +03:00
Avi Kivity
493d16e800 build: fix --enable-dpdk/--disable-dpdk configure switch
5ceb20c439 switched --enable-dpdk
to a tristate switch, but forgot that add_tristate() prepends
--enable and --disable itself; so now the switch looks like
--enable-enable-dpdk and --disable-enable-dpdk.

Fix by removing the "enable-" prefix.
2020-06-15 09:37:45 +03:00
Amnon Heiman
6e1f042b93 storage_proxy: use time_estimated_histogram for latencies
This patch change storage_proxy to use time_estimated_histogram.

Besides the type, it changes how values are inserted and how the
histogram is used by the API.

An example how a metric looks like after the change:
scylla_storage_proxy_coordinator_write_latency_bucket{le="640.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="768.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="896.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="1024.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="1280.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="1536.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="1792.000000",scheduling_group_name="statement",shard="0",type="histogram"} 2
scylla_storage_proxy_coordinator_write_latency_bucket{le="2048.000000",scheduling_group_name="statement",shard="0",type="histogram"} 2
scylla_storage_proxy_coordinator_write_latency_bucket{le="2560.000000",scheduling_group_name="statement",shard="0",type="histogram"} 3
scylla_storage_proxy_coordinator_write_latency_bucket{le="3072.000000",scheduling_group_name="statement",shard="0",type="histogram"} 5
scylla_storage_proxy_coordinator_write_latency_bucket{le="3584.000000",scheduling_group_name="statement",shard="0",type="histogram"} 5
scylla_storage_proxy_coordinator_write_latency_bucket{le="4096.000000",scheduling_group_name="statement",shard="0",type="histogram"} 7
scylla_storage_proxy_coordinator_write_latency_bucket{le="5120.000000",scheduling_group_name="statement",shard="0",type="histogram"} 8
scylla_storage_proxy_coordinator_write_latency_bucket{le="6144.000000",scheduling_group_name="statement",shard="0",type="histogram"} 9
scylla_storage_proxy_coordinator_write_latency_bucket{le="7168.000000",scheduling_group_name="statement",shard="0",type="histogram"} 11
scylla_storage_proxy_coordinator_write_latency_bucket{le="8192.000000",scheduling_group_name="statement",shard="0",type="histogram"} 11
scylla_storage_proxy_coordinator_write_latency_bucket{le="10240.000000",scheduling_group_name="statement",shard="0",type="histogram"} 19
scylla_storage_proxy_coordinator_write_latency_bucket{le="12288.000000",scheduling_group_name="statement",shard="0",type="histogram"} 49
scylla_storage_proxy_coordinator_write_latency_bucket{le="14336.000000",scheduling_group_name="statement",shard="0",type="histogram"} 132
scylla_storage_proxy_coordinator_write_latency_bucket{le="16384.000000",scheduling_group_name="statement",shard="0",type="histogram"} 294
scylla_storage_proxy_coordinator_write_latency_bucket{le="20480.000000",scheduling_group_name="statement",shard="0",type="histogram"} 1035
scylla_storage_proxy_coordinator_write_latency_bucket{le="24576.000000",scheduling_group_name="statement",shard="0",type="histogram"} 2790
scylla_storage_proxy_coordinator_write_latency_bucket{le="28672.000000",scheduling_group_name="statement",shard="0",type="histogram"} 5788
scylla_storage_proxy_coordinator_write_latency_bucket{le="32768.000000",scheduling_group_name="statement",shard="0",type="histogram"} 9815
scylla_storage_proxy_coordinator_write_latency_bucket{le="40960.000000",scheduling_group_name="statement",shard="0",type="histogram"} 19821
scylla_storage_proxy_coordinator_write_latency_bucket{le="49152.000000",scheduling_group_name="statement",shard="0",type="histogram"} 30063
scylla_storage_proxy_coordinator_write_latency_bucket{le="57344.000000",scheduling_group_name="statement",shard="0",type="histogram"} 38642
scylla_storage_proxy_coordinator_write_latency_bucket{le="65536.000000",scheduling_group_name="statement",shard="0",type="histogram"} 44987
scylla_storage_proxy_coordinator_write_latency_bucket{le="81920.000000",scheduling_group_name="statement",shard="0",type="histogram"} 51821
scylla_storage_proxy_coordinator_write_latency_bucket{le="98304.000000",scheduling_group_name="statement",shard="0",type="histogram"} 54197
scylla_storage_proxy_coordinator_write_latency_bucket{le="114688.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55054
scylla_storage_proxy_coordinator_write_latency_bucket{le="131072.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55363
scylla_storage_proxy_coordinator_write_latency_bucket{le="163840.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55520
scylla_storage_proxy_coordinator_write_latency_bucket{le="196608.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55545
scylla_storage_proxy_coordinator_write_latency_bucket{le="229376.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="262144.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="327680.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="393216.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="458752.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="524288.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="655360.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="786432.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="917504.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="1048576.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="1310720.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="1572864.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="1835008.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="2097152.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="2621440.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="3145728.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="3670016.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="4194304.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="5242880.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="6291456.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="7340032.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="8388608.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="10485760.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="12582912.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="14680064.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="16777216.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="20971520.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="25165824.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="29360128.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="33554432.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="+Inf",scheduling_group_name="statement",shard="0",type="histogram"} 55549

Fixes #4746

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-06-15 08:23:02 +03:00
Amnon Heiman
1cbc2e3d3e test/boost/estimated_histogram_test
This patch adds basic testing for the approx_exponential_histogram
implementations.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-06-15 08:22:57 +03:00
Amnon Heiman
f30f926703 utils/histogram_metrics_helper Adding histogram converter
This patch adds a helper converter function to convert from a
approx_exponential_histogram histogram to a seastar::metrics::histogram

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-06-15 08:22:49 +03:00
Amnon Heiman
3319756f36 utils/estimated_histogram: Adding approx_exponential_histogram
This patch adds an efficient histogram implementation.
The implementation chooses efficiency over flexibility.
That is why templates are used.

How the approx_exponential_histogram pseudo floating point histogram
works: It split the range [MIN, MAX] into log2(MAX/MIN) ranges it then
split each of that ranges linearly according to a given resolution.

For example, using resolution of 4, would be similar to using an
exponentially growing histogram with a coefficient of 1.2.

All values are uint64. To prevent handling of corner cases, it is not
allowed to set the MIN to be lower than the resolution.

The approx_exponential_histogram will probably not be used directly,
the first used is by time_estimated_histogram. A histogram for durations.

It should be compared to the estimated_histogram.

Performance comparison:
Comparison was done by inserting 2^20 values into
time_estimated_histogram and estimated_histogram.

In debug mode on a local machine insert operation took an average of
26.0 nanoseconds vs 342.2 nanoseconds.

In release mode insert operation took an average of 1.90 vs 8.28 nanoseconds

Fixes #5815

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-06-15 08:22:43 +03:00
Piotr Sarna
23c63ec19d Merge 'alternator: implement FilterExpression' from Nadav
The main goal of this series is to implement FilterExpression - the
newer syntax for filtering results of Query and Scan requests.

This feature itself is just one simple patch - it just needs to have the
already-existing filtering code call the already-existing expression
evaluation code. However, before we can do this, we need a patch to
refactor the expression-evaluation interface (this patch also fixes
pre-existing bugs). Then we need three additional patches to fix pre-
existing bugs in the various corner cases of expressions (this bugs
already existed in ConditionExpression but now became visible in
tests for FilerExpression). Finally, in the end of the series, we also
do a bit of code cleanup.

After this series, the FilterExpression feature is complete, and all
tests for this feature pass.

Tests: unit(dev)

* 'alternator-filterexpression' of git://github.com/nyh/scylla:
  alternator: avoid unnecessary conversion to string
  alternator: move some code out of executor.cc
  alternator: implement FilterExpression
  alternator: improve error path of attribute_type() function
  alternator: fix begins_with() error path
  alternator: fix corner case of contains() function in conditions
  alternator: refactor resolving of references in expressions
2020-06-14 19:42:46 +02:00
Avi Kivity
4220ed849b Merge "Use abseil's hash map in a couple places" from Rafael
"
This is part of the work for replacing global sstring variables with
constexpr std::string_view ones.

To have std::string_view values we have to convert a few APIs to take
std::string_view instead of sstring references.

The API conversions are complicated by the fact that
std::unordered_map doesn't support heterogeneous lookup, so we need
another hash map.

The one provided by abseil seems like a natural choice since it has an
API that looks like what is being proposed for c++
(http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1690r0.html)
but is also much faster.

A nice side effect is that this series is a 0.46% win in

perf_simple_query --duration 16 --smp 1  -m4G

Over 500 runs with randomized section layout and environment on each
run.
"

* 'espindola/absl-v10' of https://github.com/espindola/scylla:
  database: Use a flat_hash_map for _ks_cf_to_uuid
  database: Use flat_hash_map for _keyspaces
  Add absl wrapper headers
  build: Link with abseil
  cofigure: Don't overwrite seastar_cflags
  Add abseil as a submodule
2020-06-14 18:26:59 +03:00
Rafael Ávila de Espíndola
336d541f58 database: Use a flat_hash_map for _ks_cf_to_uuid
Given that the key is a std::pair, we have to explicitly mark the hash
and eq types as transparent for heterogeneous lookup to work.

With that, pass std::string_view to a few functions that just check if
a value is in the map.

This increases the .text section by 11 KiB (0.03%).

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-14 08:18:39 -07:00
Rafael Ávila de Espíndola
6da9eef25f database: Use flat_hash_map for _keyspaces
This changes the hash map used for _keyspaces. Using a flat_hash_map
allows using std::string_view in has_keyspace thanks to the
heterogeneous lookup support.

This add 200 KiB to .text, since this is the first use of absl and
brings in files from the .a.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-14 08:18:39 -07:00
Rafael Ávila de Espíndola
dd0d4ae217 Add absl wrapper headers
Using these instead of using the absl headers directly adds support
for heterogeneous lookup with sstring as key.

The is no gain from having the hash function inline, so this
implements it in a .cc file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-14 08:18:39 -07:00
Rafael Ávila de Espíndola
7d1f6725dd build: Link with abseil
It is a pity we have to list so many libraries, but abseil doesn't
provide a .pc file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-14 08:18:39 -07:00
Rafael Ávila de Espíndola
2ad09aefb6 cofigure: Don't overwrite seastar_cflags
The variable seastar_cflags was being used for flags passed to seastar
and for flags extracted from the seastar.pc file.

This introduces a new variable for the flags extracted from the
seastar.pc file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-14 08:18:39 -07:00
Rafael Ávila de Espíndola
383a9c6da9 Add abseil as a submodule
This adds the https://abseil.io library as a submodule. The patch
series that follows needs a hash table that supports heterogeneous
lookup, and abseil has a really good hash table that supports that
(https://abseil.io/blog/20180927-swisstables).

The library is still not available in Fedora, but it is fairly easy to
use it directly from a submodule.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-14 08:18:37 -07:00
Avi Kivity
08313106ce Merge 'Repair use table id instead of table name' from Asias
"
Use table_id instead of table_name in row level repair to find a table. It
guarantees we repair the same table even if a table is dropped and a new
table is created with the same name.

Refs: #5942
"

* asias-repair_use_table_id_instead_of_table_name:
  repair: Do not pass table names to repair_info
  repair: Add table_id to row_level_repair
  repair: Use table id to find a table in get_sharder_for_tables
  repair: Add table_ids to repair_info
  repair: Make func in tracker::run run inside a thread
2020-06-14 14:58:46 +03:00
Raphael S. Carvalho
9983fa8766 compaction_manager: Export backlog metric
This backlog metric holds the sum of backlog for all the tables
in the system. This is very useful for understanding the behavior
of the backlog trackers. That's how we managed to fix most of
backlog bugs like #6054, #6021, etc.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200612194908.39909-1-raphaelsc@scylladb.com>
2020-06-14 14:07:53 +03:00
Avi Kivity
76d082c2b2 Merge "Decouple client services from storage_service" from Pavel E
"
The cql_server and thrift are "owned" by storage_service for
the sake of managing those, i.e. starting and stopping. Since
other services (still) need the storage_service this creates
dependencies loops.

This set makes the client services independent from the storage
service. As a consequence of it the auth service is also removed
from storage_service and put standalone. This, in turn, sets
some tests free from the need to start and stop auth and makes
one step towards NOT join_cluster()-ing in unit tests.

Also the set fixes few wierd races on scylla start and stop
that can trigger local_is_initialized() asserts, and one case of
unclear aborted shutdown when client services remain running
till the scylla process exit.

Yet another benefit is localization of "isolating" functionality
that sits deeper in storage_service than it should.

One thing that's not completely clean after it is the need for cql
server to continue referencing the service_memory_limiter semaphore
from the storage_service, but this will go away with one of the
next sets.

tests: unit(debug), manual start-stop,
       nodetool check of cql/thrift start/stop
"

* 'br-split-transport-1' of https://github.com/xemul/scylla:
  storage_service: Isolate isolator
  auth: Move away from storage_service
  auth: Move start-stop code into main
  main: Don't forget to stop cql/thrift when start is aborted
  thrift_controller: Switch on standalone
  thrift_controller: Pass one through management API
  thrift_controller: Move the code into thrift/
  thrift_controller: Introduce own lock for management
  thrift: Wrap start/stop/is_running code into a class
  cql_controller: Switch on standalone
  cql_controller: Pass one through management API
  cql_controller: Move the code into transport/
  cql_controller: Introduce own lock for management
  cql: Wrap start/stop/is_running code into a class
  api: Tune reg/unreg of client services control endpoints
2020-06-14 13:49:23 +03:00
Takuya ASADA
863293576c scylla_setup: add swapfile setup
Adding swapfile setup on scylla_setup.

Fixes #6539
2020-06-14 13:18:51 +03:00
Amnon Heiman
06510a4752 service/storage_service.cc: Make effective_ownership preemptable
A lot is going on when calculating effective ownership.
For each node in the cluster, we need to go over all the ranges belong
to that node and see if that node is the owner or not.

This patch uses futurized loops with do_for_each so it would preempt if
needed.

The patch replaces the current for-loops with do_for_each and do_with
but keeps the logic.

Fixes #6380
2020-06-14 12:56:07 +03:00
Nadav Har'El
493d7e6716 alternator: avoid unnecessary conversion to string
In a couple of places, where we already have a std::string_view, there
is no need to convert to to a std::string (which requires allocation).

One cool observation (by Piotr Sarna) is that map over std::string_view
is fine, when the strings in the map are always string constants.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 12:16:26 +03:00
Nadav Har'El
8c026b9f10 alternator: move some code out of executor.cc
The source file alternator/executor.cc has grown too much, reaching almost
4,000 lines. In this patch I move about 400 lines out of executor.cc:

1. Some functions related to serialization of sets and lists were moved to
   serialization.cc,
2. Functions related to evaluating parsed expressions were moved to
   expressions.cc.

The header file expressions_eval.hh was also removed - the calculate_value()
functions now live in expressions.cc, so we can just define them in
expressions.hh, no need for a separate header files.

This patch just moves code around. It doesn't make any functional changes.

Refs #5783.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 12:16:26 +03:00
Nadav Har'El
0b9f25ab50 alternator: implement FilterExpression
This patch provides a complete implementation for the FilterExpression
parameter - the newer syntax for filtering the results of the Query or
Scan operations.

The implementation is pretty straightforward - we already added earlier
a result-filtering framework to Alternator, and used it for the older
filtering syntax - QuryFilter and ScanFilter. All we had to do now was
to run the FilterExpression (which has the same syntax as a
ConditionExpression) on each individual items. The previous cleanup
patches were important to reduce the friction of running these expressions
on the items.

After the previous patches fixing small esoteric bugs in a few expression
functions, with this patch *all* the tests in test_filter_expression.py
now pass, and so do the two FilterExpression tests in test_query.py and
test_scan.py. As far as I know (and of course minus any bugs we'll discover
later), this marks the FilterExpression feature complete.

Fixes #5038.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 12:16:26 +03:00
Nadav Har'El
f87259a762 alternator: improve error path of attribute_type() function
The attribute_type() function, which can be used in expressions like
ConditionExpression and FilterExpression, is supposed to generate an
error if its second parameter is not one of the known types. What we
did until now was to just report a failed check in this case.

We already had a reproducing test with FilterExpression, but in this patch
we also add a test with ConditionExpression - which fails before this
patch and passes afterwards (and of course, passes with DynamoDB).

Fixes #6641.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 12:16:20 +03:00
Nadav Har'El
11d86dfb06 alternator: fix begins_with() error path
The begins_with() function should report an error if a constant is
passed to it which isn't one of the supported types - string or bytes
(e.g., a number).

The code we had to check this had wrong logic, though. If the item
attribute was also a number, we silently returned false, and didn't
go on to detect that the second parameter - a constant - was a number
too and should generate an error - not be silent.

Fixed and added a reproducing test case and another test to validate
my understanding of the type of parameters that begins_with() accepts.

Fixes #6640.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 12:13:23 +03:00
Nadav Har'El
f79a4e0e78 alternator: fix corner case of contains() function in conditions
It turns out that the contains() functions in the new syntax of
conditions (ConditionExpression, FilterExpression) is not identical
to the CONTAINS operator in the old-syntax conditions (Expected).

In the new syntax, one can check whether *any* constant object is contained
in a list. In the old syntax, the constant object must be of specific
types.

So we need to move the testing out of the check_CONTAINS() functions
that both implementations used, and into just the implementation of
the old syntax (in conditions.cc).

This bug broke one of the FilterExpression tests, but this patch also
adds new tests for the different behaviour of ConditionExpression and
Expected - tests which also reproduce this issue and verify its fix.

Fixes #6639.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 12:02:14 +03:00
Nadav Har'El
13ef31f38b alternator: refactor resolving of references in expressions
In the DynamoDB API, expressions (e.g., ConditionExpression and many more)
may contain references to column names ("#name") or to values (":val")
given in a separate part of the request - ExpressionAttributeNames and
ExpressionAttributeValues respectively.

Before this patch, we resolved these references as part of the expression's
evaluation. This approach had two downsides:

1. It often misdiagnosed (both false negatives and false positives) cases
   of unused names and values in expressions. We already had two xfailing
   tests with examples - which pass after this patch. This patch also
   adds two additional tests, which failed before this patch and pass
   with it.

2. In one of the following patches we will add support for FilterExpression,
   where the same expression is used repeatedly on many items. It is a waste
   (as well as makes the code uglier) to resolve the same references again
   and again each time the expression is evaluated. We should be able
   to do it just once.

So this patch introduces an intermediate step between parsing and evaluating
an expression - "resolving" the expression. The new resolve_*() functions
modify the already parsed expression, replacing references to attribute
names and constant values by the actual names and values taken from the
request. The resolve_*() functions also keep track which references were
used, making it very easy to check (as DynamoDB does) if there are any
unused names or values, before starting the evaluation.

The interface of evaluate() functions become much simpler - they no longer
need to know the original request (which was previously needed for
ExpressionAttributeNames/Values), the table's schema (which was previously
needed only for some error checking), keep track of which references were
used. This simplification is helpful for using the expressions in contexts
where these things (request and schema) are no longer conveniently available,
namely in FilterExpression.

A small side-benefit of this patch is that it moves a bit of code, which
handled resolving of references in expressions, from executor.cc to
expressions.cc. This is just the first step in a bigger effort to
reduce the size of executor.cc by moving code to smaller source files.
There is no attempt in this patch to move as much code as we can.
We will move more code in a separate patch in this series.

Fixes #6572.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 11:57:13 +03:00
Glauber Costa
b0a0c207c3 twcs: move implementations to its own file
LCS and SCTS already have their own files, reducing the clutter in
compaction_strategy.cc. Do the same for TWCS. I am doing this in
preparation to add more functions.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200611230906.409023-6-glauber@scylladb.com>
2020-06-14 11:50:08 +03:00
Pavel Emelyanov
514a1580da storage_service: Isolate isolator
There is a code that isolates a node on disk error. After all the previous
changes this code can be collected in one place (better to move it away from
storage_service at all, but still).

This simplifies the stop_transport(): now it can avoid rescheduling itself
on shard 0 for the 2nd time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:33 +03:00