Compare commits

..

21 Commits

Author SHA1 Message Date
Avi Kivity
2edc1ca039 tools: toolchain: update to Fedora 33 with clang 11
Update the toolchain to Fedora 33 with clang 11 (note the
build still uses gcc).

The image now creates a /root/.m2/repository directory; without
this the tools/jmx build fails on aarch64.

Add java-1.8.0-openjdk-devel since that is where javac lives now.
Add a JAVA8_HOME environment variable; wihtout this ant is not
able to find javac.

The toolchain is enabled for x86_64 and aarch64.
2020-10-28 17:02:53 +02:00
Avi Kivity
5ff5d43c7a Update tools/java submodule
* tools/java e97c106047...ad48b44a26 (1):
  > build: Add generated Thrift sources to multi-Java build
2020-10-28 16:52:25 +02:00
Pavel Emelyanov
b2ce3b197e allocation_strategy: Fix standard_migrator initialization
This is the continuation of 30722b8c8e, so let me re-cite Rafael:

    The constructors of these global variables can allocate memory. Since
    the variables are thread_local, they are initialized at first use.

    There is nothing we can do if these allocations fail, so use
    disable_failure_guard.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201028140553.21709-1-xemul@scylladb.com>
2020-10-28 16:22:23 +02:00
Asias He
289a08072a repair: Make repair_writer a shared pointer
The future of the fiber that writes data into sstables inside
the repair_writer is stored in _writer_done like below:

class repair_writer {
   _writer_done[node_idx] =
      mutation_writer::distribute_reader_and_consume_on_shards().then([this] {
         ...
      }).handle_exception([this] {
         ...
      });
}

The fiber access repair_writer object in the error handling path. We
wait for the _writer_done to finish before we destroy repair_meta
object which contains the repair_writer object to avoid the fiber
accessing already freed repair_writer object.

To be safer, we can make repair_writer a shared pointer and take a
reference in the distribute_reader_and_consume_on_shards code path.

Fixes #7406

Closes #7430
2020-10-28 16:22:23 +02:00
Avi Kivity
4b9206a180 install: abort if LD_PRELOAD is set when executing a relocatable binary
LD_PRELOAD libraries usually have dependencies in the host system,
which they will not have access to in a relocatable environment
since we use a different libc. Detect that LD_PRELOAD is in use and if
so, abort with an error.

Fixes #7493.

Closes #7494
2020-10-28 16:22:23 +02:00
Avi Kivity
2a42fc5cde build: supply linker flags only to the linker, not the compiler
Clang complains if it sees linker-only flags when called for compilation,
so move the compile-time flags from cxx_ld_flags to cxxflags, and remove
cxx_ld_flags from the compiler command line.

The linker flags are also passed to Seastar so that the build-id and
interpreter hacks still apply to iotune.

Closes #7466
2020-10-28 16:22:23 +02:00
Avi Kivity
fc15d0a4be build: relocatable package: exclude tools/python3
python3 has its own relocatable package, no need to include it
in scylla-package.tar.gz.

Python has its own relocatable package, so packaging it in scylla-package.ta

Closes #7467
2020-10-28 16:22:23 +02:00
Avi Kivity
6eb3ba74e4 Update tools/java submodule
* tools/java f2e8666d7e...e97c106047 (1):
  > Relocatable Package: create product prefixed relocatable archive
2020-10-28 08:47:49 +02:00
Juliusz Stasiewicz
e0176bccab create_table_statement: Disallow default TTL on counter tables
In such attempt `invalid_request_exception` is thrown.
Also, simple CQL test is added.

Fixes #6879
2020-10-27 22:44:02 +02:00
Nadav Har'El
92b741b4ff alternator test: more tests for disabled streams and closed shards
We already have a test for the behavior of a closed shard and how
iterators previously created for it are still valid. In this patch
we add to this also checking that the shard id itself, not just the
iterator, is still valid.

Additionally, although the aforementioned test used a disabled stream
to create a closed shard, it was not a complete test for the behavior
of a disabled stream, and this patch adds such a test. We check that
although the stream is disabled, it is still fully usable (for 24 hours) -
its original ARN is still listed on ListStreams, the ARN is still usable,
its shards can be listed, all are marked as closed but still fully readable.

Both tests pass on DynamoDB, and xfail on Alternator because of
issue #7239 - CDC drops the CDC log table as soon as CDC is disabled,
so the stream data is lost immediately instead of being retained for
24 hours.

Refs #7239

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201006183915.434055-1-nyh@scylladb.com>
2020-10-27 22:44:02 +02:00
Nadav Har'El
a57d4c0092 docs: clean up format of docs/alternator/getting-started.md
In https://github.com/scylladb/scylla-docs/pull/3105 it was noted that
the Sphynx document parser doesn't like a horizontal line ("---") in
the beginning of a section. Since there is no real reason why we must
have this horizontal line, let's just remove it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201001151312.261825-1-nyh@scylladb.com>
2020-10-27 22:44:02 +02:00
Avi Kivity
e2a02f15c2 Merge 'transport/system_ks: Add more info to system.clients' from Juliusz Stasiewicz
This patch fills the following columns in `system.clients` table:
* `connection_stage`
* `driver_name`
* `driver_version`
* `protocol_version`

It also improves:
* `client_type` - distinguishes cql from thrift just in case
* `username` - now it displays correct username iff `PasswordAuthenticator` is configured.

What is still missing:
* SSL params (I'll happily get some advice here)
* `hostname` - I didn't find it in tested drivers

Refs #6946

Closes #7349

* github.com:scylladb/scylla:
  transport: Update `connection_stage` in `system.clients`
  transport: Retrieve driver's name and version from STARTUP message
  transport: Notify `system.clients` about "protocol_version"
  transport: On successful authentication add `username` to system.clients
2020-10-27 22:44:02 +02:00
Amnon Heiman
52db99f25f scyllatop/livedata.py: Safe iteration over metrics
This patch change the code that iterates over the metrics to use a copy
of the metrics names to make it safe to remove the metrics from the
metrics object.

Fixes #7488

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-10-27 22:44:02 +02:00
Calle Wilund
1bc96a5785 alternator::streams: Make describe_stream use actual log ttl as window
Allows QA to bypass the normal hardcoded 24h ttl of data and still
get "proper" behaviour w.r.t. available stream set/generations.
I.e. can manually change cdc ttl option for alternator table after
streams enabled. Should not be exposed, but perhaps useful for
testing.

Closes #7483
2020-10-26 12:16:36 +02:00
Calle Wilund
4b65d67a1a partition_version: Change range_tombstones() to return chunked_vector
Refs #7364

The number of tombstones can be large. As a stopgap measure to
just returning a source range (with keepalive), we can at least
alleviate the problem by using a chunked vector.

Closes #7433
2020-10-26 11:54:42 +02:00
Benny Halevy
82aabab054 table: get rid of reshuffle_sstables
It is unused since 7351db7cab

Refs #6950

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201026074914.34721-1-bhalevy@scylladb.com>
2020-10-26 09:50:21 +02:00
Calle Wilund
46ea8c9b8b cdc: Add an "end-of-record" column to
Fixes #7435

Adds an "eor" (end-of-record) column to cdc log. This is non-null only on
last-in-timestamp group rows, i.e. end of a singular source "event".

A client can use this as a shortcut to knowing whether or not he has a
full cdc "record" for a given source mutation (single row change).

Closes #7436
2020-10-26 09:39:27 +02:00
Juliusz Stasiewicz
0251cb9b31 transport: Update connection_stage in system.clients 2020-10-12 18:44:00 +02:00
Juliusz Stasiewicz
6abe1352ba transport: Retrieve driver's name and version from STARTUP message 2020-10-12 18:37:19 +02:00
Juliusz Stasiewicz
d2d162ece3 transport: Notify system.clients about "protocol_version" 2020-10-12 18:32:00 +02:00
Juliusz Stasiewicz
acf0341e9b transport: On successful authentication add username to system.clients
The username becomes known in the course of resolving challenges
from `PasswordAuthenticator`. That's why username is being set on
successful authentication; until then all users are "anonymous".
Meanwhile, `AllowAllAuthenticator` (the default) does not request
username, so users logged with it will remain as "anonymous" in
`system.clients`.

Shuffling of code was necessary to unify existing infrastructure
for INSERTing entries into `system.clients` with later UPDATEs.
2020-10-06 18:52:46 +02:00
144 changed files with 1066 additions and 3503 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../scylla-seastar
url = ../seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -1,7 +1,7 @@
#!/bin/sh
PRODUCT=scylla
VERSION=4.3.7
VERSION=666.development
if test -f version
then

View File

@@ -123,7 +123,7 @@ struct rjson_engaged_ptr_comp {
// as internally they're stored in an array, and the order of elements is
// not important in set equality. See issue #5021
static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2) {
if (!set1.IsArray() || !set2.IsArray() || set1.Size() != set2.Size()) {
if (set1.Size() != set2.Size()) {
return false;
}
std::set<const rjson::value*, rjson_engaged_ptr_comp> set1_raw;
@@ -137,107 +137,45 @@ static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2
}
return true;
}
// Moreover, the JSON being compared can be a nested document with outer
// layers of lists and maps and some inner set - and we need to get to that
// inner set to compare it correctly with check_EQ_for_sets() (issue #8514).
static bool check_EQ(const rjson::value* v1, const rjson::value& v2);
static bool check_EQ_for_lists(const rjson::value& list1, const rjson::value& list2) {
if (!list1.IsArray() || !list2.IsArray() || list1.Size() != list2.Size()) {
return false;
}
auto it1 = list1.Begin();
auto it2 = list2.Begin();
while (it1 != list1.End()) {
// Note: Alternator limits an item's depth (rjson::parse() limits
// it to around 37 levels), so this recursion is safe.
if (!check_EQ(&*it1, *it2)) {
return false;
}
++it1;
++it2;
}
return true;
}
static bool check_EQ_for_maps(const rjson::value& list1, const rjson::value& list2) {
if (!list1.IsObject() || !list2.IsObject() || list1.MemberCount() != list2.MemberCount()) {
return false;
}
for (auto it1 = list1.MemberBegin(); it1 != list1.MemberEnd(); ++it1) {
auto it2 = list2.FindMember(it1->name);
if (it2 == list2.MemberEnd() || !check_EQ(&it1->value, it2->value)) {
return false;
}
}
return true;
}
// Check if two JSON-encoded values match with the EQ relation
static bool check_EQ(const rjson::value* v1, const rjson::value& v2) {
if (v1 && v1->IsObject() && v1->MemberCount() == 1 && v2.IsObject() && v2.MemberCount() == 1) {
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name != it2->name) {
return false;
}
if (it1->name == "SS" || it1->name == "NS" || it1->name == "BS") {
return check_EQ_for_sets(it1->value, it2->value);
} else if(it1->name == "L") {
return check_EQ_for_lists(it1->value, it2->value);
} else if(it1->name == "M") {
return check_EQ_for_maps(it1->value, it2->value);
} else {
// Other, non-nested types (number, string, etc.) can be compared
// literally, comparing their JSON representation.
return it1->value == it2->value;
}
} else {
// If v1 and/or v2 are missing (IsNull()) the result should be false.
// In the unlikely case that the object is malformed (issue #8070),
// let's also return false.
if (!v1) {
return false;
}
if (v1->IsObject() && v1->MemberCount() == 1 && v2.IsObject() && v2.MemberCount() == 1) {
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if ((it1->name == "SS" && it2->name == "SS") || (it1->name == "NS" && it2->name == "NS") || (it1->name == "BS" && it2->name == "BS")) {
return check_EQ_for_sets(it1->value, it2->value);
}
}
return *v1 == v2;
}
// Check if two JSON-encoded values match with the NE relation
static bool check_NE(const rjson::value* v1, const rjson::value& v2) {
return !check_EQ(v1, v2);
return !v1 || *v1 != v2; // null is unequal to anything.
}
// Check if two JSON-encoded values match with the BEGINS_WITH relation
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (v1_from_query) {
throw api_error::validation("begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v1->MemberBegin()->name != "S" && v1->MemberBegin()->name != "B") {
if (v1_from_query) {
throw api_error::validation(format("begins_with supports only string or binary type, got: {}", *v1));
} else {
bad = true;
}
}
static bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2) {
// BEGINS_WITH requires that its single operand (v2) be a string or
// binary - otherwise it's a validation error. However, problems with
// the stored attribute (v1) will just return false (no match).
if (!v2.IsObject() || v2.MemberCount() != 1) {
if (v2_from_query) {
throw api_error::validation("begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
if (v2_from_query) {
throw api_error::validation(format("begins_with() supports only string or binary type, got: {}", v2));
} else {
bad = true;
}
throw api_error::validation(format("BEGINS_WITH operator encountered malformed AttributeValue: {}", v2));
}
if (bad) {
auto it2 = v2.MemberBegin();
if (it2->name != "S" && it2->name != "B") {
throw api_error::validation(format("BEGINS_WITH operator requires String or Binary type in AttributeValue, got {}", it2->name));
}
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
return false;
}
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name != it2->name) {
return false;
}
@@ -341,40 +279,24 @@ static bool check_NOT_NULL(const rjson::value* val) {
return val != nullptr;
}
// Only types S, N or B (string, number or bytes) may be compared by the
// various comparion operators - lt, le, gt, ge, and between.
// Note that in particular, if the value is missing (v->IsNull()), this
// check returns false.
static bool check_comparable_type(const rjson::value& v) {
if (!v.IsObject() || v.MemberCount() != 1) {
return false;
}
const rjson::value& type = v.MemberBegin()->name;
return type == "S" || type == "N" || type == "B";
}
// Check if two JSON-encoded values match with cmp.
template <typename Comparator>
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !check_comparable_type(*v1)) {
if (v1_from_query) {
throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp) {
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error::validation(
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
}
if (!check_comparable_type(v2)) {
if (v2_from_query) {
throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
const auto& kv2 = *v2.MemberBegin();
if (kv2.name != "S" && kv2.name != "N" && kv2.name != "B") {
throw api_error::validation(
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
}
if (bad) {
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
return false;
}
const auto& kv1 = *v1->MemberBegin();
const auto& kv2 = *v2.MemberBegin();
if (kv1.name != kv2.name) {
return false;
}
@@ -388,8 +310,7 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
if (kv1.name == "B") {
return cmp(base64_decode(kv1.value), base64_decode(kv2.value));
}
// cannot reach here, as check_comparable_type() verifies the type is one
// of the above options.
clogger.error("check_compare panic: LHS type equals RHS type, but one is in {N,S,B} while the other isn't");
return false;
}
@@ -420,71 +341,56 @@ struct cmp_gt {
static constexpr const char* diagnostic = "GT operator";
};
// True if v is between lb and ub, inclusive. Throws or returns false
// (depending on bounds_from_query parameter) if lb > ub.
// True if v is between lb and ub, inclusive. Throws if lb > ub.
template <typename T>
static bool check_BETWEEN(const T& v, const T& lb, const T& ub, bool bounds_from_query) {
static bool check_BETWEEN(const T& v, const T& lb, const T& ub) {
if (cmp_lt()(ub, lb)) {
if (bounds_from_query) {
throw api_error::validation(
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
} else {
return false;
}
throw api_error::validation(
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
}
return cmp_ge()(v, lb) && cmp_le()(v, ub);
}
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub,
bool v_from_query, bool lb_from_query, bool ub_from_query) {
if ((v && v_from_query && !check_comparable_type(*v)) ||
(lb_from_query && !check_comparable_type(lb)) ||
(ub_from_query && !check_comparable_type(ub))) {
throw api_error::validation("between allow only the types String, Number, or Binary");
}
if (!v || !v->IsObject() || v->MemberCount() != 1 ||
!lb.IsObject() || lb.MemberCount() != 1 ||
!ub.IsObject() || ub.MemberCount() != 1) {
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub) {
if (!v) {
return false;
}
if (!v->IsObject() || v->MemberCount() != 1) {
throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", *v));
}
if (!lb.IsObject() || lb.MemberCount() != 1) {
throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", lb));
}
if (!ub.IsObject() || ub.MemberCount() != 1) {
throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", ub));
}
const auto& kv_v = *v->MemberBegin();
const auto& kv_lb = *lb.MemberBegin();
const auto& kv_ub = *ub.MemberBegin();
bool bounds_from_query = lb_from_query && ub_from_query;
if (kv_lb.name != kv_ub.name) {
if (bounds_from_query) {
throw api_error::validation(
throw api_error::validation(
format("BETWEEN operator requires the same type for lower and upper bound; instead got {} and {}",
kv_lb.name, kv_ub.name));
} else {
return false;
}
}
if (kv_v.name != kv_lb.name) { // Cannot compare different types, so v is NOT between lb and ub.
return false;
}
if (kv_v.name == "N") {
const char* diag = "BETWEEN operator";
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag));
}
if (kv_v.name == "S") {
return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),
std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),
bounds_from_query);
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()));
}
if (kv_v.name == "B") {
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value), bounds_from_query);
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value));
}
if (v_from_query) {
throw api_error::validation(
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
throw api_error::validation(
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
kv_lb.name));
} else {
return false;
}
}
// Verify one Expect condition on one attribute (whose content is "got")
@@ -531,19 +437,19 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NE(got, (*attribute_value_list)[0]);
case comparison_operator_type::LT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_lt{}, false, true);
return check_compare(got, (*attribute_value_list)[0], cmp_lt{});
case comparison_operator_type::LE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_le{}, false, true);
return check_compare(got, (*attribute_value_list)[0], cmp_le{});
case comparison_operator_type::GT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_gt{}, false, true);
return check_compare(got, (*attribute_value_list)[0], cmp_gt{});
case comparison_operator_type::GE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_ge{}, false, true);
return check_compare(got, (*attribute_value_list)[0], cmp_ge{});
case comparison_operator_type::BEGINS_WITH:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_BEGINS_WITH(got, (*attribute_value_list)[0], false, true);
return check_BEGINS_WITH(got, (*attribute_value_list)[0]);
case comparison_operator_type::IN:
verify_operand_count(attribute_value_list, nonempty(), *comparison_operator);
return check_IN(got, *attribute_value_list);
@@ -555,8 +461,7 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NOT_NULL(got);
case comparison_operator_type::BETWEEN:
verify_operand_count(attribute_value_list, exact_size(2), *comparison_operator);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1],
false, true, true);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1]);
case comparison_operator_type::CONTAINS:
{
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
@@ -668,8 +573,7 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Wrong number of values {} in BETWEEN primitive_condition", cond._values.size()));
}
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2],
cond._values[0].is_constant(), cond._values[1].is_constant(), cond._values[2].is_constant());
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2]);
case parsed::primitive_condition::type::IN:
return check_IN(calculated_values);
case parsed::primitive_condition::type::VALUE:
@@ -700,17 +604,13 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
case parsed::primitive_condition::type::NE:
return check_NE(&calculated_values[0], calculated_values[1]);
case parsed::primitive_condition::type::GT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{});
case parsed::primitive_condition::type::GE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{},
cond._values[0].is_constant(), cond._values[1].is_constant());
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{});
case parsed::primitive_condition::type::LT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{});
case parsed::primitive_condition::type::LE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{},
cond._values[0].is_constant(), cond._values[1].is_constant());
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{});
default:
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Unknown type {} in primitive_condition object", (int)(cond._op)));

View File

@@ -52,7 +52,6 @@ bool verify_expected(const rjson::value& req, const rjson::value* previous_item)
bool verify_condition(const rjson::value& condition, bool require_all, const rjson::value* previous_item);
bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2);
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2, bool v1_from_query, bool v2_from_query);
bool verify_condition_expression(
const parsed::condition_expression& condition_expression,

View File

@@ -1881,8 +1881,7 @@ static std::string get_item_type_string(const rjson::value& v) {
// calculate_attrs_to_get() takes either AttributesToGet or
// ProjectionExpression parameters (having both is *not* allowed),
// and returns the list of cells we need to read, or an empty set when
// *all* attributes are to be returned.
// and returns the list of cells we need to read.
// In our current implementation, only top-level attributes are stored
// as cells, and nested documents are stored serialized as JSON.
// So this function currently returns only the the top-level attributes
@@ -2244,30 +2243,19 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item.get());
rjson::value v2 = calculate_value(addition, calculate_value_caller::UpdateExpression, previous_item.get());
rjson::value result;
// An ADD can be used to create a new attribute (when
// v1.IsNull()) or to add to a pre-existing attribute:
if (v1.IsNull()) {
std::string v2_type = get_item_type_string(v2);
if (v2_type == "N" || v2_type == "SS" || v2_type == "NS" || v2_type == "BS") {
result = v2;
} else {
throw api_error::validation(format("An operand in the update expression has an incorrect data type: {}", v2));
std::string v1_type = get_item_type_string(v1);
if (v1_type == "N") {
if (get_item_type_string(v2) != "N") {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = number_add(v1, v2);
} else if (v1_type == "SS" || v1_type == "NS" || v1_type == "BS") {
if (get_item_type_string(v2) != v1_type) {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = set_sum(v1, v2);
} else {
std::string v1_type = get_item_type_string(v1);
if (v1_type == "N") {
if (get_item_type_string(v2) != "N") {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = number_add(v1, v2);
} else if (v1_type == "SS" || v1_type == "NS" || v1_type == "BS") {
if (get_item_type_string(v2) != v1_type) {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = set_sum(v1, v2);
} else {
throw api_error::validation(format("An operand in the update expression has an incorrect data type: {}", v1));
}
throw api_error::validation(format("An operand in the update expression has an incorrect data type: {}", v1));
}
do_update(to_bytes(column_name), result);
},
@@ -2583,10 +2571,6 @@ public:
std::unordered_set<std::string>& used_attribute_values);
bool check(const rjson::value& item) const;
bool filters_on(std::string_view attribute) const;
// for_filters_on() runs the given function on the attributes that the
// filter works on. It may run for the same attribute more than once if
// used more than once in the filter.
void for_filters_on(const noncopyable_function<void(std::string_view)>& func) const;
operator bool() const { return bool(_imp); }
};
@@ -2667,26 +2651,10 @@ bool filter::filters_on(std::string_view attribute) const {
}, *_imp);
}
void filter::for_filters_on(const noncopyable_function<void(std::string_view)>& func) const {
if (_imp) {
std::visit(overloaded_functor {
[&] (const conditions_filter& f) -> void {
for (auto it = f.conditions.MemberBegin(); it != f.conditions.MemberEnd(); ++it) {
func(rjson::to_string_view(it->name));
}
},
[&] (const expression_filter& f) -> void {
return for_condition_expression_on(f.expression, func);
}
}, *_imp);
}
}
class describe_items_visitor {
typedef std::vector<const column_definition*> columns_t;
const columns_t& _columns;
const std::unordered_set<std::string>& _attrs_to_get;
std::unordered_set<std::string> _extra_filter_attrs;
const filter& _filter;
typename columns_t::const_iterator _column_it;
rjson::value _item;
@@ -2702,20 +2670,7 @@ public:
, _item(rjson::empty_object())
, _items(rjson::empty_array())
, _scanned_count(0)
{
// _filter.check() may need additional attributes not listed in
// _attrs_to_get (i.e., not requested as part of the output).
// We list those in _extra_filter_attrs. We will include them in
// the JSON but take them out before finally returning the JSON.
if (!_attrs_to_get.empty()) {
_filter.for_filters_on([&] (std::string_view attr) {
std::string a(attr); // no heterogenous maps searches :-(
if (!_attrs_to_get.contains(a)) {
_extra_filter_attrs.emplace(std::move(a));
}
});
}
}
{ }
void start_row() {
_column_it = _columns.begin();
@@ -2729,7 +2684,7 @@ public:
result_bytes_view->with_linearized([this] (bytes_view bv) {
std::string column_name = (*_column_it)->name_as_text();
if (column_name != executor::ATTRS_COLUMN_NAME) {
if (_attrs_to_get.empty() || _attrs_to_get.contains(column_name) || _extra_filter_attrs.contains(column_name)) {
if (_attrs_to_get.empty() || _attrs_to_get.contains(column_name)) {
if (!_item.HasMember(column_name.c_str())) {
rjson::set_with_string_name(_item, column_name, rjson::empty_object());
}
@@ -2741,7 +2696,7 @@ public:
auto keys_and_values = value_cast<map_type_impl::native_type>(deserialized);
for (auto entry : keys_and_values) {
std::string attr_name = value_cast<sstring>(entry.first);
if (_attrs_to_get.empty() || _attrs_to_get.contains(attr_name) || _extra_filter_attrs.contains(attr_name)) {
if (_attrs_to_get.empty() || _attrs_to_get.contains(attr_name)) {
bytes value = value_cast<bytes>(entry.second);
rjson::set_with_string_name(_item, attr_name, deserialize_item(value));
}
@@ -2753,11 +2708,6 @@ public:
void end_row() {
if (_filter.check(_item)) {
// Remove the extra attributes _extra_filter_attrs which we had
// to add just for the filter, and not requested to be returned:
for (const auto& attr : _extra_filter_attrs) {
rjson::remove_member(_item, attr);
}
rjson::push_back(_items, std::move(_item));
}
_item = rjson::empty_object();
@@ -2792,7 +2742,7 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
for (const column_definition& cdef : schema.partition_key_columns()) {
rjson::set_with_string_name(last_evaluated_key, std::string_view(cdef.name_as_text()), rjson::empty_object());
rjson::value& key_entry = last_evaluated_key[cdef.name_as_text()];
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), json_key_column_value(*exploded_pk_it, cdef));
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), rjson::parse(to_json_string(*cdef.type, *exploded_pk_it)));
++exploded_pk_it;
}
auto ck = paging_state.get_clustering_key();
@@ -2802,7 +2752,7 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
for (const column_definition& cdef : schema.clustering_key_columns()) {
rjson::set_with_string_name(last_evaluated_key, std::string_view(cdef.name_as_text()), rjson::empty_object());
rjson::value& key_entry = last_evaluated_key[cdef.name_as_text()];
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), json_key_column_value(*exploded_ck_it, cdef));
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), rjson::parse(to_json_string(*cdef.type, *exploded_ck_it)));
++exploded_ck_it;
}
}

View File

@@ -348,39 +348,6 @@ bool condition_expression_on(const parsed::condition_expression& ce, std::string
}, ce._expression);
}
// for_condition_expression_on() runs a given function over all the attributes
// mentioned in the expression. If the same attribute is mentioned more than
// once, the function will be called more than once for the same attribute.
static void for_value_on(const parsed::value& v, const noncopyable_function<void(std::string_view)>& func) {
std::visit(overloaded_functor {
[&] (const parsed::constant& c) { },
[&] (const parsed::value::function_call& f) {
for (const parsed::value& value : f._parameters) {
for_value_on(value, func);
}
},
[&] (const parsed::path& p) {
func(p.root());
}
}, v._value);
}
void for_condition_expression_on(const parsed::condition_expression& ce, const noncopyable_function<void(std::string_view)>& func) {
std::visit(overloaded_functor {
[&] (const parsed::primitive_condition& cond) {
for (const parsed::value& value : cond._values) {
for_value_on(value, func);
}
},
[&] (const parsed::condition_expression::condition_list& list) {
for (const parsed::condition_expression& cond : list.conditions) {
for_condition_expression_on(cond, func);
}
}
}, ce._expression);
}
// The following calculate_value() functions calculate, or evaluate, a parsed
// expression. The parsed expression is assumed to have been "resolved", with
// the matching resolve_* function.
@@ -603,8 +570,52 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);
return to_bool_json(check_BEGINS_WITH(v1.IsNull() ? nullptr : &v1, v2,
f._parameters[0].is_constant(), f._parameters[1].is_constant()));
// TODO: There's duplication here with check_BEGINS_WITH().
// But unfortunately, the two functions differ a bit.
// If one of v1 or v2 is malformed or has an unsupported type
// (not B or S), what we do depends on whether it came from
// the user's query (is_constant()), or the item. Unsupported
// values in the query result in an error, but if they are in
// the item, we silently return false (no match).
bool bad = false;
if (!v1.IsObject() || v1.MemberCount() != 1) {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error::validation(format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v1));
}
} else if (v1.MemberBegin()->name != "S" && v1.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error::validation(format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v1));
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error::validation(format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v2));
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error::validation(format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v2));
}
}
bool ret = false;
if (!bad) {
auto it1 = v1.MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name == it2->name) {
if (it2->name == "S") {
std::string_view val1 = rjson::to_string_view(it1->value);
std::string_view val2 = rjson::to_string_view(it2->value);
ret = val1.starts_with(val2);
} else /* it2->name == "B" */ {
ret = base64_begins_with(rjson::to_string_view(it1->value), rjson::to_string_view(it2->value));
}
}
}
return to_bool_json(ret);
}
},
{"contains", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {

View File

@@ -27,8 +27,6 @@
#include <unordered_set>
#include <string_view>
#include <seastar/util/noncopyable_function.hh>
#include "expressions_types.hh"
#include "utils/rjson.hh"
@@ -61,11 +59,6 @@ void validate_value(const rjson::value& v, const char* caller);
bool condition_expression_on(const parsed::condition_expression& ce, std::string_view attribute);
// for_condition_expression_on() runs the given function on the attributes
// that the expression uses. It may run for the same attribute more than once
// if the same attribute is used more than once in the expression.
void for_condition_expression_on(const parsed::condition_expression& ce, const noncopyable_function<void(std::string_view)>& func);
// calculate_value() behaves slightly different (especially, different
// functions supported) when used in different types of expressions, as
// enumerated in this enum:

View File

@@ -475,6 +475,8 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
status = "ENABLED";
}
}
auto ttl = std::chrono::seconds(opts.ttl());
rjson::set(stream_desc, "StreamStatus", rjson::from_string(status));
@@ -498,10 +500,10 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
// cannot really "resume" query, must iterate all data. because we cannot query neither "time" (pk) > something,
// or on expired...
// TODO: maybe add secondary index to topology table to enable this?
return _sdks.cdc_get_versioned_streams({ tm.count_normal_token_owners() }).then([this, &db, schema, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc)](std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {
return _sdks.cdc_get_versioned_streams({ tm.count_normal_token_owners() }).then([this, &db, schema, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc), ttl](std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {
// filter out cdc generations older than the table or now() - dynamodb_streams_max_window (24h)
auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - dynamodb_streams_max_window);
// filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)
auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);
auto i = topologies.lower_bound(low_ts);
// need first gen _intersecting_ the timestamp.
@@ -849,7 +851,6 @@ future<executor::request_return_type> executor::get_records(client_state& client
static const bytes timestamp_column_name = cdc::log_meta_column_name_bytes("time");
static const bytes op_column_name = cdc::log_meta_column_name_bytes("operation");
static const bytes eor_column_name = cdc::log_meta_column_name_bytes("end_of_batch");
auto key_names = boost::copy_range<std::unordered_set<std::string>>(
boost::range::join(std::move(base->partition_key_columns()), std::move(base->clustering_key_columns()))
@@ -873,7 +874,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
std::transform(cks.begin(), cks.end(), std::back_inserter(columns), [](auto& c) { return &c; });
auto regular_columns = boost::copy_range<query::column_id_vector>(schema->regular_columns()
| boost::adaptors::filtered([](const column_definition& cdef) { return cdef.name() == op_column_name || cdef.name() == eor_column_name || !cdc::is_cdc_metacolumn_name(cdef.name_as_text()); })
| boost::adaptors::filtered([](const column_definition& cdef) { return cdef.name() == op_column_name || !cdc::is_cdc_metacolumn_name(cdef.name_as_text()); })
| boost::adaptors::transformed([&] (const column_definition& cdef) { columns.emplace_back(&cdef); return cdef.id; })
);
@@ -906,11 +907,6 @@ future<executor::request_return_type> executor::get_records(client_state& client
return cdef->name->name() == timestamp_column_name;
})
);
auto eor_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == eor_column_name;
})
);
std::optional<utils::UUID> timestamp;
auto dynamodb = rjson::empty_object();
@@ -936,7 +932,15 @@ future<executor::request_return_type> executor::get_records(client_state& client
for (auto& row : result_set->rows()) {
auto op = static_cast<cdc::operation>(value_cast<op_utype>(data_type_for<op_utype>()->deserialize(*row[op_index])));
auto ts = value_cast<utils::UUID>(data_type_for<utils::UUID>()->deserialize(*row[ts_index]));
auto eor = row[eor_index].has_value() ? value_cast<bool>(boolean_type->deserialize(*row[eor_index])) : false;
if (timestamp && timestamp != ts) {
maybe_add_record();
if (limit == 0) {
break;
}
}
timestamp = ts;
if (!dynamodb.HasMember("Keys")) {
auto keys = rjson::empty_object();
@@ -989,13 +993,9 @@ future<executor::request_return_type> executor::get_records(client_state& client
rjson::set(record, "eventName", "REMOVE");
break;
}
if (eor) {
maybe_add_record();
timestamp = ts;
if (limit == 0) {
break;
}
}
}
if (limit > 0 && timestamp) {
maybe_add_record();
}
auto ret = rjson::empty_object();
@@ -1049,9 +1049,6 @@ void executor::add_stream_options(const rjson::value& stream_specification, sche
if (!db.features().cluster_supports_cdc()) {
throw api_error::validation("StreamSpecification: streams (CDC) feature not enabled in cluster.");
}
if (!db.features().cluster_supports_alternator_streams()) {
throw api_error::validation("StreamSpecification: alternator streams feature not enabled in cluster.");
}
cdc::options opts;
opts.enabled(true);

View File

@@ -331,15 +331,15 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t{0}, [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], 0, [](column_family& cf) {
return cf.active_memtable().partition_count();
}, std::plus<>());
}, std::plus<int>());
});
cf::get_all_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t{0}, [](column_family& cf) {
return map_reduce_cf(ctx, 0, [](column_family& cf) {
return cf.active_memtable().partition_count();
}, std::plus<>());
}, std::plus<int>());
});
cf::get_memtable_on_heap_size.set(r, [] (const_req req) {
@@ -656,7 +656,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_size();
return sst->filter_size();
});
}, std::plus<uint64_t>());
});
@@ -664,7 +664,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_size();
return sst->filter_size();
});
}, std::plus<uint64_t>());
});
@@ -672,7 +672,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_memory_size();
return sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
@@ -680,7 +680,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_memory_size();
return sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
@@ -688,7 +688,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->get_summary().memory_footprint();
return sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});
@@ -696,7 +696,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->get_summary().memory_footprint();
return sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});

View File

@@ -20,16 +20,10 @@
#pragma once
#include <map>
#include <seastar/core/sstring.hh>
#include "bytes.hh"
#include "serializer.hh"
#include "db/extensions.hh"
#include "cdc/cdc_options.hh"
#include "schema.hh"
#include "serializer_impl.hh"
namespace cdc {

View File

@@ -23,7 +23,6 @@
#include <random>
#include <unordered_set>
#include <seastar/core/sleep.hh>
#include <algorithm>
#include "keys.hh"
#include "schema_builder.hh"
@@ -175,29 +174,10 @@ bool topology_description::operator==(const topology_description& o) const {
return _entries == o._entries;
}
const std::vector<token_range_description>& topology_description::entries() const& {
const std::vector<token_range_description>& topology_description::entries() const {
return _entries;
}
std::vector<token_range_description>&& topology_description::entries() && {
return std::move(_entries);
}
static std::vector<stream_id> create_stream_ids(
size_t index, dht::token start, dht::token end, size_t shard_count, uint8_t ignore_msb) {
std::vector<stream_id> result;
result.reserve(shard_count);
dht::sharder sharder(shard_count, ignore_msb);
for (size_t shard_idx = 0; shard_idx < shard_count; ++shard_idx) {
auto t = dht::find_first_token_for_shard(sharder, start, end, shard_idx);
// compose the id from token and the "index" of the range end owning vnode
// as defined by token sort order. Basically grouping within this
// shard set.
result.emplace_back(stream_id(t, index));
}
return result;
}
class topology_description_generator final {
const db::config& _cfg;
const std::unordered_set<dht::token>& _bootstrap_tokens;
@@ -237,9 +217,18 @@ class topology_description_generator final {
desc.token_range_end = end;
auto [shard_count, ignore_msb] = get_sharding_info(end);
desc.streams = create_stream_ids(index, start, end, shard_count, ignore_msb);
desc.streams.reserve(shard_count);
desc.sharding_ignore_msb = ignore_msb;
dht::sharder sharder(shard_count, ignore_msb);
for (size_t shard_idx = 0; shard_idx < shard_count; ++shard_idx) {
auto t = dht::find_first_token_for_shard(sharder, start, end, shard_idx);
// compose the id from token and the "index" of the range end owning vnode
// as defined by token sort order. Basically grouping within this
// shard set.
desc.streams.emplace_back(stream_id(t, index));
}
return desc;
}
public:
@@ -305,38 +294,6 @@ future<db_clock::time_point> get_local_streams_timestamp() {
});
}
// non-static for testing
size_t limit_of_streams_in_topology_description() {
// Each stream takes 16B and we don't want to exceed 4MB so we can have
// at most 262144 streams but not less than 1 per vnode.
return 4 * 1024 * 1024 / 16;
}
// non-static for testing
topology_description limit_number_of_streams_if_needed(topology_description&& desc) {
int64_t streams_count = 0;
for (auto& tr_desc : desc.entries()) {
streams_count += tr_desc.streams.size();
}
size_t limit = std::max(limit_of_streams_in_topology_description(), desc.entries().size());
if (limit >= size_t(streams_count)) {
return std::move(desc);
}
size_t streams_per_vnode_limit = limit / desc.entries().size();
auto entries = std::move(desc).entries();
auto start = entries.back().token_range_end;
for (size_t idx = 0; idx < entries.size(); ++idx) {
auto end = entries[idx].token_range_end;
if (entries[idx].streams.size() > streams_per_vnode_limit) {
entries[idx].streams =
create_stream_ids(idx, start, end, streams_per_vnode_limit, entries[idx].sharding_ignore_msb);
}
start = end;
}
return topology_description(std::move(entries));
}
// Run inside seastar::async context.
db_clock::time_point make_new_cdc_generation(
const db::config& cfg,
@@ -349,18 +306,6 @@ db_clock::time_point make_new_cdc_generation(
using namespace std::chrono;
auto gen = topology_description_generator(cfg, bootstrap_tokens, tm, g).generate();
// If the cluster is large we may end up with a generation that contains
// large number of streams. This is problematic because we store the
// generation in a single row. For a generation with large number of rows
// this will lead to a row that can be as big as 32MB. This is much more
// than the limit imposed by commitlog_segment_size_in_mb. If the size of
// the row that describes a new generation grows above
// commitlog_segment_size_in_mb, the write will fail and the new node won't
// be able to join. To avoid such problem we make sure that such row is
// always smaller than 4MB. We do that by removing some CDC streams from
// each vnode if the total number of streams is too large.
gen = limit_number_of_streams_if_needed(std::move(gen));
// Begin the race.
auto ts = db_clock::now() + (
(for_testing || ring_delay == milliseconds(0)) ? milliseconds(0) : (

View File

@@ -68,7 +68,6 @@ public:
stream_id() = default;
stream_id(bytes);
stream_id(dht::token, size_t);
bool is_set() const;
bool operator==(const stream_id&) const;
@@ -82,6 +81,9 @@ public:
partition_key to_partition_key(const schema& log_schema) const;
static int64_t token_from_bytes(bytes_view);
private:
friend class topology_description_generator;
stream_id(dht::token, size_t);
};
/* Describes a mapping of tokens to CDC streams in a token range.
@@ -114,8 +116,7 @@ public:
topology_description(std::vector<token_range_description> entries);
bool operator==(const topology_description&) const;
const std::vector<token_range_description>& entries() const&;
std::vector<token_range_description>&& entries() &&;
const std::vector<token_range_description>& entries() const;
};
/**
@@ -153,7 +154,7 @@ bool should_propose_first_generation(const gms::inet_address& me, const gms::gos
future<db_clock::time_point> get_local_streams_timestamp();
/* Generate a new set of CDC streams and insert it into the distributed cdc_generation_descriptions table.
* Returns the timestamp of this new generation
* Returns the timestamp of this new generation.
*
* Should be called when starting the node for the first time (i.e., joining the ring).
*

View File

@@ -980,9 +980,9 @@ static bytes get_bytes(const atomic_cell_view& acv) {
return acv.value().linearize();
}
static bytes_view get_bytes_view(const atomic_cell_view& acv, std::forward_list<bytes>& buf) {
static bytes_view get_bytes_view(const atomic_cell_view& acv, std::vector<bytes>& buf) {
return acv.value().is_fragmented()
? bytes_view{buf.emplace_front(acv.value().linearize())}
? bytes_view{buf.emplace_back(acv.value().linearize())}
: acv.value().first_fragment();
}
@@ -1137,9 +1137,9 @@ struct process_row_visitor {
struct udt_visitor : public collection_visitor {
std::vector<bytes_opt> _added_cells;
std::forward_list<bytes>& _buf;
std::vector<bytes>& _buf;
udt_visitor(ttl_opt& ttl_column, size_t num_keys, std::forward_list<bytes>& buf)
udt_visitor(ttl_opt& ttl_column, size_t num_keys, std::vector<bytes>& buf)
: collection_visitor(ttl_column), _added_cells(num_keys), _buf(buf) {}
void live_collection_cell(bytes_view key, const atomic_cell_view& cell) {
@@ -1148,7 +1148,7 @@ struct process_row_visitor {
}
};
std::forward_list<bytes> buf;
std::vector<bytes> buf;
udt_visitor v(_ttl_column, type.size(), buf);
visit_collection(v);
@@ -1167,9 +1167,9 @@ struct process_row_visitor {
struct map_or_list_visitor : public collection_visitor {
std::vector<std::pair<bytes_view, bytes_view>> _added_cells;
std::forward_list<bytes>& _buf;
std::vector<bytes>& _buf;
map_or_list_visitor(ttl_opt& ttl_column, std::forward_list<bytes>& buf)
map_or_list_visitor(ttl_opt& ttl_column, std::vector<bytes>& buf)
: collection_visitor(ttl_column), _buf(buf) {}
void live_collection_cell(bytes_view key, const atomic_cell_view& cell) {
@@ -1178,7 +1178,7 @@ struct process_row_visitor {
}
};
std::forward_list<bytes> buf;
std::vector<bytes> buf;
map_or_list_visitor v(_ttl_column, buf);
visit_collection(v);
@@ -1290,13 +1290,6 @@ struct process_change_visitor {
_clustering_row_states, _generate_delta_values);
visit_row_cells(v);
if (_enable_updating_state) {
// #7716: if there are no regular columns, our visitor would not have visited any cells,
// hence it would not have created a row_state for this row. In effect, postimage wouldn't be produced.
// Ensure that the row state exists.
_clustering_row_states.try_emplace(ckey);
}
_builder.set_operation(log_ck, v._cdc_op);
_builder.set_ttl(log_ck, v._ttl_column);
}

View File

@@ -51,8 +51,7 @@ static cdc::stream_id get_stream(
return entry.streams[shard_id];
}
// non-static for testing
cdc::stream_id get_stream(
static cdc::stream_id get_stream(
const std::vector<cdc::token_range_description>& entries,
dht::token tok) {
if (entries.empty()) {

View File

@@ -257,25 +257,24 @@ modes = {
'stack-usage-threshold': 1024*40,
},
'release': {
'cxxflags': '',
'cxx_ld_flags': '-O3 -ffunction-sections -fdata-sections -Wl,--gc-sections',
'cxxflags': '-O3 -ffunction-sections -fdata-sections ',
'cxx_ld_flags': '-Wl,--gc-sections',
'stack-usage-threshold': 1024*13,
},
'dev': {
'cxxflags': '-DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxx_ld_flags': '-O1',
'cxxflags': '-O1 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxx_ld_flags': '',
'stack-usage-threshold': 1024*21,
},
'sanitize': {
'cxxflags': '-DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxx_ld_flags': '-Os',
'cxxflags': '-Os -DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxx_ld_flags': '',
'stack-usage-threshold': 1024*50,
}
}
scylla_tests = set([
'test/boost/UUID_test',
'test/boost/cdc_generation_test',
'test/boost/aggregate_fcts_test',
'test/boost/allocation_strategy_test',
'test/boost/alternator_base64_test',
@@ -855,7 +854,6 @@ scylla_core = (['database.cc',
'utils/error_injection.cc',
'mutation_writer/timestamp_based_splitting_writer.cc',
'mutation_writer/shard_based_splitting_writer.cc',
'mutation_writer/feed_writers.cc',
'lua.cc',
] + [Antlr3Grammar('cql3/Cql.g')] + [Thrift('interface/cassandra.thrift', 'Cassandra')]
)
@@ -1169,11 +1167,11 @@ optimization_flags = [
optimization_flags = [o
for o in optimization_flags
if flag_supported(flag=o, compiler=args.cxx)]
modes['release']['cxx_ld_flags'] += ' ' + ' '.join(optimization_flags)
modes['release']['cxxflags'] += ' ' + ' '.join(optimization_flags)
if flag_supported(flag='-Wstack-usage=4096', compiler=args.cxx):
for mode in modes:
modes[mode]['cxx_ld_flags'] += f' -Wstack-usage={modes[mode]["stack-usage-threshold"]} -Wno-error=stack-usage='
modes[mode]['cxxflags'] += f' -Wstack-usage={modes[mode]["stack-usage-threshold"]} -Wno-error=stack-usage='
linker_flags = linker_flags(compiler=args.cxx)
@@ -1340,6 +1338,13 @@ libdeflate_cflags = seastar_cflags
MODE_TO_CMAKE_BUILD_TYPE = {'release' : 'RelWithDebInfo', 'debug' : 'Debug', 'dev' : 'Dev', 'sanitize' : 'Sanitize' }
# cmake likes to separate things with semicolons
def semicolon_separated(*flags):
# original flags may be space separated, so convert to string still
# using spaces
f = ' '.join(flags)
return re.sub(' +', ';', f)
def configure_seastar(build_dir, mode):
seastar_build_dir = os.path.join(build_dir, mode, 'seastar')
@@ -1348,8 +1353,8 @@ def configure_seastar(build_dir, mode):
'-DCMAKE_C_COMPILER={}'.format(args.cc),
'-DCMAKE_CXX_COMPILER={}'.format(args.cxx),
'-DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON',
'-DSeastar_CXX_FLAGS={}'.format((seastar_cflags + ' ' + modes[mode]['cxx_ld_flags']).replace(' ', ';')),
'-DSeastar_LD_FLAGS={}'.format(seastar_ldflags),
'-DSeastar_CXX_FLAGS={}'.format((seastar_cflags).replace(' ', ';')),
'-DSeastar_LD_FLAGS={}'.format(semicolon_separated(seastar_ldflags, modes[mode]['cxx_ld_flags'])),
'-DSeastar_CXX_DIALECT=gnu++20',
'-DSeastar_API_LEVEL=6',
'-DSeastar_UNUSED_RESULT_ERROR=ON',

View File

@@ -20,44 +20,47 @@
*/
#include "connection_notifier.hh"
#include "db/query_context.hh"
#include "cql3/constants.hh"
#include "database.hh"
#include "service/storage_proxy.hh"
#include <stdexcept>
namespace db::system_keyspace {
extern const char *const CLIENTS;
}
static sstring to_string(client_type ct) {
sstring to_string(client_type ct) {
switch (ct) {
case client_type::cql: return "cql";
case client_type::thrift: return "thrift";
case client_type::alternator: return "alternator";
default: throw std::runtime_error("Invalid client_type");
}
throw std::runtime_error("Invalid client_type");
}
static sstring to_string(client_connection_stage ccs) {
switch (ccs) {
case client_connection_stage::established: return connection_stage_literal<client_connection_stage::established>;
case client_connection_stage::authenticating: return connection_stage_literal<client_connection_stage::authenticating>;
case client_connection_stage::ready: return connection_stage_literal<client_connection_stage::ready>;
}
throw std::runtime_error("Invalid client_connection_stage");
}
future<> notify_new_client(client_data cd) {
// FIXME: consider prepared statement
const static sstring req
= format("INSERT INTO system.{} (address, port, client_type, shard_id, protocol_version, username) "
"VALUES (?, ?, ?, ?, ?, ?);", db::system_keyspace::CLIENTS);
= format("INSERT INTO system.{} (address, port, client_type, connection_stage, shard_id, protocol_version, username) "
"VALUES (?, ?, ?, ?, ?, ?, ?);", db::system_keyspace::CLIENTS);
return db::execute_cql(req,
std::move(cd.ip), cd.port, to_string(cd.ct), cd.shard_id,
std::move(cd.ip), cd.port, to_string(cd.ct), to_string(cd.connection_stage), cd.shard_id,
cd.protocol_version.has_value() ? data_value(*cd.protocol_version) : data_value::make_null(int32_type),
cd.username.value_or("anonymous")).discard_result();
}
future<> notify_disconnected_client(gms::inet_address addr, client_type ct, int port) {
future<> notify_disconnected_client(net::inet_address addr, int port, client_type ct) {
// FIXME: consider prepared statement
const static sstring req
= format("DELETE FROM system.{} where address=? AND port=? AND client_type=?;",
db::system_keyspace::CLIENTS);
return db::execute_cql(req, addr.addr(), port, to_string(ct)).discard_result();
return db::execute_cql(req, std::move(addr), port, to_string(ct)).discard_result();
}
future<> clear_clientlist() {

View File

@@ -20,27 +20,65 @@
*/
#pragma once
#include "gms/inet_address.hh"
#include "db/query_context.hh"
#include <seastar/net/inet_address.hh>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
#include <optional>
namespace db::system_keyspace {
extern const char *const CLIENTS;
}
enum class client_type {
cql = 0,
thrift,
alternator,
};
sstring to_string(client_type ct);
enum class changed_column {
username = 0,
connection_stage,
driver_name,
driver_version,
hostname,
protocol_version,
};
template <changed_column column> constexpr const char* column_literal = "";
template <> constexpr const char* column_literal<changed_column::username> = "username";
template <> constexpr const char* column_literal<changed_column::connection_stage> = "connection_stage";
template <> constexpr const char* column_literal<changed_column::driver_name> = "driver_name";
template <> constexpr const char* column_literal<changed_column::driver_version> = "driver_version";
template <> constexpr const char* column_literal<changed_column::hostname> = "hostname";
template <> constexpr const char* column_literal<changed_column::protocol_version> = "protocol_version";
enum class client_connection_stage {
established = 0,
authenticating,
ready,
};
template <client_connection_stage ccs> constexpr const char* connection_stage_literal = "";
template <> constexpr const char* connection_stage_literal<client_connection_stage::established> = "ESTABLISHED";
template <> constexpr const char* connection_stage_literal<client_connection_stage::authenticating> = "AUTHENTICATING";
template <> constexpr const char* connection_stage_literal<client_connection_stage::ready> = "READY";
// Representation of a row in `system.clients'. std::optionals are for nullable cells.
struct client_data {
gms::inet_address ip;
net::inet_address ip;
int32_t port;
client_type ct;
client_connection_stage connection_stage = client_connection_stage::established;
int32_t shard_id; /// ID of server-side shard which is processing the connection.
// `optional' column means that it's nullable (possibly because it's
// unimplemented yet). If you want to fill ("implement") any of them,
// remember to update the query in `notify_new_client()'.
std::optional<sstring> connection_stage;
std::optional<sstring> driver_name;
std::optional<sstring> driver_version;
std::optional<sstring> hostname;
@@ -52,6 +90,17 @@ struct client_data {
};
future<> notify_new_client(client_data cd);
future<> notify_disconnected_client(gms::inet_address addr, client_type ct, int port);
future<> notify_disconnected_client(net::inet_address addr, int port, client_type ct);
future<> clear_clientlist();
template <changed_column column_enum_val>
struct notify_client_change {
template <typename T>
future<> operator()(net::inet_address addr, int port, client_type ct, T&& value) {
const static sstring req
= format("UPDATE system.{} SET {}=? WHERE address=? AND port=? AND client_type=?;",
db::system_keyspace::CLIENTS, column_literal<column_enum_val>);
return db::execute_cql(req, std::forward<T>(value), std::move(addr), port, to_string(ct)).discard_result();
}
};

View File

@@ -192,12 +192,9 @@ public:
virtual ::shared_ptr<terminal> bind(const query_options& options) override {
auto bytes = bind_and_get(options);
if (bytes.is_null()) {
if (!bytes) {
return ::shared_ptr<terminal>{};
}
if (bytes.is_unset_value()) {
return UNSET_VALUE;
}
return ::make_shared<constants::value>(std::move(cql3::raw_value::make_value(to_bytes(*bytes))));
}
};

View File

@@ -27,9 +27,7 @@
#include <fmt/ostream.h>
#include <unordered_map>
#include "cql3/constants.hh"
#include "cql3/lists.hh"
#include "cql3/statements/request_validations.hh"
#include "cql3/tuples.hh"
#include "index/secondary_index_manager.hh"
#include "types/list.hh"
@@ -419,8 +417,6 @@ bool is_one_of(const column_value& col, term& rhs, const column_value_eval_bag&
} else if (auto mkr = dynamic_cast<lists::marker*>(&rhs)) {
// This is `a IN ?`. RHS elements are values representable as bytes_opt.
const auto values = static_pointer_cast<lists::value>(mkr->bind(bag.options));
statements::request_validations::check_not_null(
values, "Invalid null value for column %s", col.col->name_as_text());
return boost::algorithm::any_of(values->get_elements(), [&] (const bytes_opt& b) {
return equal(b, col, bag);
});
@@ -572,8 +568,7 @@ const auto deref = boost::adaptors::transformed([] (const bytes_opt& b) { return
/// Returns possible values from t, which must be RHS of IN.
value_list get_IN_values(
const ::shared_ptr<term>& t, const query_options& options, const serialized_compare& comparator,
sstring_view column_name) {
const ::shared_ptr<term>& t, const query_options& options, const serialized_compare& comparator) {
// RHS is prepared differently for different CQL cases. Cast it dynamically to discern which case this is.
if (auto dv = dynamic_pointer_cast<lists::delayed_value>(t)) {
// Case `a IN (1,2,3)`.
@@ -583,12 +578,8 @@ value_list get_IN_values(
return to_sorted_vector(std::move(result_range), comparator);
} else if (auto mkr = dynamic_pointer_cast<lists::marker>(t)) {
// Case `a IN ?`. Collect all list-element values.
const auto val = mkr->bind(options);
if (val == constants::UNSET_VALUE) {
throw exceptions::invalid_request_exception(format("Invalid unset value for column {}", column_name));
}
statements::request_validations::check_not_null(val, "Invalid null value for IN tuple");
return to_sorted_vector(static_pointer_cast<lists::value>(val)->get_elements() | non_null | deref, comparator);
const auto val = static_pointer_cast<lists::value>(mkr->bind(options));
return to_sorted_vector(val->get_elements() | non_null | deref, comparator);
}
throw std::logic_error(format("get_IN_values(single column) on invalid term {}", *t));
}
@@ -695,7 +686,7 @@ value_set possible_lhs_values(const column_definition* cdef, const expression& e
return oper.op == oper_t::EQ ? value_set(value_list{*val})
: to_range(oper.op, *val);
} else if (oper.op == oper_t::IN) {
return get_IN_values(oper.rhs, options, type->as_less_comparator(), cdef->name_as_text());
return get_IN_values(oper.rhs, options, type->as_less_comparator());
}
throw std::logic_error(format("possible_lhs_values: unhandled operator {}", oper));
},

View File

@@ -305,12 +305,6 @@ maps::setter_by_key::execute(mutation& m, const clustering_key_prefix& prefix, c
assert(column.type->is_multi_cell()); // "Attempted to set a value for a single key on a frozen map"m
auto key = _k->bind_and_get(params._options);
auto value = _t->bind_and_get(params._options);
if (value.is_unset_value()) {
return;
}
if (key.is_unset_value() || value.is_unset_value()) {
throw invalid_request_exception("Invalid unset map key");
}
if (!key) {
throw invalid_request_exception("Invalid null map key");
}

View File

@@ -315,7 +315,7 @@ sets::discarder::execute(mutation& m, const clustering_key_prefix& row_key, cons
assert(column.type->is_multi_cell()); // "Attempted to remove items from a frozen set";
auto&& value = _t->bind(params._options);
if (!value || value == constants::UNSET_VALUE) {
if (!value) {
return;
}

View File

@@ -306,13 +306,6 @@ create_index_statement::announce_migration(service::storage_proxy& proxy, bool i
format("Index {} is a duplicate of existing index {}", index.name(), existing_index.value().name()));
}
}
auto index_table_name = secondary_index::index_table_name(accepted_name);
if (db.has_schema(keyspace(), index_table_name)) {
return make_exception_future<::shared_ptr<cql_transport::event::schema_change>>(
exceptions::invalid_request_exception(format("Index {} cannot be created, because table {} already exists",
accepted_name, index_table_name))
);
}
++_cql_stats->secondary_index_creates;
schema_builder builder{schema};
builder.with_index(index);

View File

@@ -204,6 +204,7 @@ std::unique_ptr<prepared_statement> create_table_statement::raw_statement::prepa
}
_properties.validate(db, _properties.properties()->make_schema_extensions(db.extensions()));
const bool has_default_ttl = _properties.properties()->get_default_time_to_live() > 0;
auto stmt = ::make_shared<create_table_statement>(_cf_name, _properties.properties(), _if_not_exists, _static_columns, _properties.properties()->get_id());
@@ -211,6 +212,11 @@ std::unique_ptr<prepared_statement> create_table_statement::raw_statement::prepa
for (auto&& entry : _definitions) {
::shared_ptr<column_identifier> id = entry.first;
cql3_type pt = entry.second->prepare(db, keyspace());
if (has_default_ttl && pt.is_counter()) {
throw exceptions::invalid_request_exception("Cannot set default_time_to_live on a table with counters");
}
if (pt.get_type()->is_multi_cell()) {
if (pt.get_type()->is_user_type()) {
// check for multi-cell types (non-frozen UDTs or collections) inside a non-frozen UDT

View File

@@ -59,7 +59,6 @@
#include "db/timeout_clock.hh"
#include "db/consistency_level_validations.hh"
#include "database.hh"
#include "test/lib/select_statement_utils.hh"
#include <boost/algorithm/cxx11/any_of.hpp>
bool is_system_keyspace(const sstring& name);
@@ -68,8 +67,6 @@ namespace cql3 {
namespace statements {
static constexpr int DEFAULT_INTERNAL_PAGING_SIZE = select_statement::DEFAULT_COUNT_PAGE_SIZE;
thread_local int internal_paging_size = DEFAULT_INTERNAL_PAGING_SIZE;
thread_local const lw_shared_ptr<const select_statement::parameters> select_statement::_default_parameters = make_lw_shared<select_statement::parameters>();
select_statement::parameters::parameters()
@@ -341,7 +338,7 @@ select_statement::do_execute(service::storage_proxy& proxy,
const bool aggregate = _selection->is_aggregate() || has_group_by();
const bool nonpaged_filtering = restrictions_need_filtering && page_size <= 0;
if (aggregate || nonpaged_filtering) {
page_size = internal_paging_size;
page_size = DEFAULT_COUNT_PAGE_SIZE;
}
auto key_ranges = _restrictions->get_partition_key_ranges(options);
@@ -456,7 +453,7 @@ generate_base_key_from_index_pk(const partition_key& index_pk, const std::option
if (!view_col) {
throw std::runtime_error(format("Base key column not found in the view: {}", base_col.name_as_text()));
}
if (base_col.type->without_reversed() != *view_col->type) {
if (base_col.type != view_col->type) {
throw std::runtime_error(format("Mismatched types for base and view columns {}: {} and {}",
base_col.name_as_text(), base_col.type->cql3_type_name(), view_col->type->cql3_type_name()));
}
@@ -544,29 +541,13 @@ indexed_table_select_statement::do_execute_base_query(
if (old_paging_state && concurrency == 1) {
auto base_pk = generate_base_key_from_index_pk<partition_key>(old_paging_state->get_partition_key(),
old_paging_state->get_clustering_key(), *_schema, *_view_schema);
auto row_ranges = command->slice.default_row_ranges();
if (old_paging_state->get_clustering_key() && _schema->clustering_key_size() > 0) {
auto base_ck = generate_base_key_from_index_pk<clustering_key>(old_paging_state->get_partition_key(),
old_paging_state->get_clustering_key(), *_schema, *_view_schema);
query::trim_clustering_row_ranges_to(*_schema, row_ranges, base_ck, false);
command->slice.set_range(*_schema, base_pk, row_ranges);
command->slice.set_range(*_schema, base_pk,
std::vector<query::clustering_range>{query::clustering_range::make_starting_with(range_bound<clustering_key>(base_ck, false))});
} else {
// There is no clustering key in old_paging_state and/or no clustering key in
// _schema, therefore read an entire partition (whole clustering range).
//
// The only exception to applying no restrictions on clustering key
// is a case when we have a secondary index on the first column
// of clustering key. In such a case we should not read the
// entire clustering range - only a range in which first column
// of clustering key has the correct value.
//
// This means that we should not set a open_ended_both_sides
// clustering range on base_pk, instead intersect it with
// _row_ranges (which contains the restrictions neccessary for the
// case described above). The result of such intersection is just
// _row_ranges, which we explicity set on base_pk.
command->slice.set_range(*_schema, base_pk, row_ranges);
command->slice.set_range(*_schema, base_pk, std::vector<query::clustering_range>{query::clustering_range::make_open_ended_both_sides()});
}
}
concurrency *= 2;
@@ -1011,16 +992,12 @@ indexed_table_select_statement::do_execute(service::storage_proxy& proxy,
const bool aggregate = _selection->is_aggregate() || has_group_by();
if (aggregate) {
const bool restrictions_need_filtering = _restrictions->need_filtering();
return do_with(cql3::selection::result_set_builder(*_selection, now, options.get_cql_serialization_format(), *_group_by_cell_indices), std::make_unique<cql3::query_options>(cql3::query_options(options)),
return do_with(cql3::selection::result_set_builder(*_selection, now, options.get_cql_serialization_format()), std::make_unique<cql3::query_options>(cql3::query_options(options)),
[this, &options, &proxy, &state, now, whole_partitions, partition_slices, restrictions_need_filtering] (cql3::selection::result_set_builder& builder, std::unique_ptr<cql3::query_options>& internal_options) {
// page size is set to the internal count page size, regardless of the user-provided value
internal_options.reset(new cql3::query_options(std::move(internal_options), options.get_paging_state(), internal_paging_size));
internal_options.reset(new cql3::query_options(std::move(internal_options), options.get_paging_state(), DEFAULT_COUNT_PAGE_SIZE));
return repeat([this, &builder, &options, &internal_options, &proxy, &state, now, whole_partitions, partition_slices, restrictions_need_filtering] () {
auto consume_results = [this, &builder, &options, &internal_options, &proxy, &state, restrictions_need_filtering] (foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd, lw_shared_ptr<const service::pager::paging_state> paging_state) {
if (paging_state) {
paging_state = generate_view_paging_state_from_base_query_results(paging_state, results, proxy, state, options);
}
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? make_lw_shared<service::pager::paging_state>(*paging_state) : nullptr));
auto consume_results = [this, &builder, &options, &internal_options, restrictions_need_filtering] (foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd) {
if (restrictions_need_filtering) {
_stats.filtered_rows_read_total += *results->row_count();
query::result_view::consume(*results, cmd->slice, cql3::selection::result_set_builder::visitor(builder, *_schema, *_selection,
@@ -1028,24 +1005,24 @@ indexed_table_select_statement::do_execute(service::storage_proxy& proxy,
} else {
query::result_view::consume(*results, cmd->slice, cql3::selection::result_set_builder::visitor(builder, *_schema, *_selection));
}
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
return stop_iteration(!has_more_pages);
};
if (whole_partitions || partition_slices) {
return find_index_partition_ranges(proxy, state, *internal_options).then_unpack(
[this, now, &state, &internal_options, &proxy, consume_results = std::move(consume_results)] (dht::partition_range_vector partition_ranges, lw_shared_ptr<const service::pager::paging_state> paging_state) {
return do_execute_base_query(proxy, std::move(partition_ranges), state, *internal_options, now, paging_state)
.then_unpack([paging_state, consume_results = std::move(consume_results)](foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd) {
return consume_results(std::move(results), std::move(cmd), std::move(paging_state));
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? make_lw_shared<service::pager::paging_state>(*paging_state) : nullptr));
return do_execute_base_query(proxy, std::move(partition_ranges), state, *internal_options, now, std::move(paging_state)).then_unpack(consume_results).then([has_more_pages] {
return stop_iteration(!has_more_pages);
});
});
} else {
return find_index_clustering_rows(proxy, state, *internal_options).then_unpack(
[this, now, &state, &internal_options, &proxy, consume_results = std::move(consume_results)] (std::vector<primary_key> primary_keys, lw_shared_ptr<const service::pager::paging_state> paging_state) {
return this->do_execute_base_query(proxy, std::move(primary_keys), state, *internal_options, now, paging_state)
.then_unpack([paging_state, consume_results = std::move(consume_results)](foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd) {
return consume_results(std::move(results), std::move(cmd), std::move(paging_state));
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? make_lw_shared<service::pager::paging_state>(*paging_state) : nullptr));
return this->do_execute_base_query(proxy, std::move(primary_keys), state, *internal_options, now, std::move(paging_state)).then_unpack(consume_results).then([has_more_pages] {
return stop_iteration(!has_more_pages);
});
});
}
@@ -1120,11 +1097,7 @@ query::partition_slice indexed_table_select_statement::get_partition_slice_for_g
if (single_ck_restrictions) {
auto prefix_restrictions = single_ck_restrictions->get_longest_prefix_restrictions();
auto clustering_restrictions_from_base = ::make_shared<restrictions::single_column_clustering_key_restrictions>(_view_schema, *prefix_restrictions);
const auto indexed_column = _view_schema->get_column_definition(to_bytes(_index.target_column()));
for (auto restriction_it : clustering_restrictions_from_base->restrictions()) {
if (restriction_it.first == indexed_column) {
continue; // In the index table, the indexed column is the partition (not clustering) key.
}
clustering_restrictions->merge_with(restriction_it.second);
}
}
@@ -1714,16 +1687,6 @@ std::vector<size_t> select_statement::prepare_group_by(const schema& schema, sel
}
future<> set_internal_paging_size(int paging_size) {
return seastar::smp::invoke_on_all([paging_size] {
internal_paging_size = paging_size;
});
}
future<> reset_internal_paging_size() {
return set_internal_paging_size(DEFAULT_INTERNAL_PAGING_SIZE);
}
}
namespace util {

View File

@@ -572,6 +572,9 @@ void database::set_format_by_config() {
}
database::~database() {
_read_concurrency_sem.clear_inactive_reads();
_streaming_concurrency_sem.clear_inactive_reads();
_system_read_concurrency_sem.clear_inactive_reads();
}
void database::update_version(const utils::UUID& version) {
@@ -659,22 +662,11 @@ future<> database::parse_system_tables(distributed<service::storage_proxy>& prox
});
}).then([&proxy, &mm, this] {
return do_parse_schema_tables(proxy, db::schema_tables::VIEWS, [this, &proxy, &mm] (schema_result_value_type &v) {
return create_views_from_schema_partition(proxy, v.second).then([this, &mm, &proxy] (std::vector<view_ptr> views) {
return parallel_for_each(views.begin(), views.end(), [this, &mm, &proxy] (auto&& v) {
// TODO: Remove once computed columns are guaranteed to be featured in the whole cluster.
// we fix here the schema in place in oreder to avoid races (write commands comming from other coordinators).
view_ptr fixed_v = maybe_fix_legacy_secondary_index_mv_schema(*this, v, nullptr, preserve_version::yes);
view_ptr v_to_add = fixed_v ? fixed_v : v;
future<> f = this->add_column_family_and_make_directory(v_to_add);
if (bool(fixed_v)) {
v_to_add = fixed_v;
auto&& keyspace = find_keyspace(v->ks_name()).metadata();
auto mutations = db::schema_tables::make_update_view_mutations(keyspace, view_ptr(v), fixed_v, api::new_timestamp(), true);
f = f.then([this, &proxy, mutations = std::move(mutations)] {
return db::schema_tables::merge_schema(proxy, _feat, std::move(mutations));
});
}
return f;
return create_views_from_schema_partition(proxy, v.second).then([this, &mm] (std::vector<view_ptr> views) {
return parallel_for_each(views.begin(), views.end(), [this, &mm] (auto&& v) {
return this->add_column_family_and_make_directory(v).then([this, &mm, v] {
return maybe_update_legacy_secondary_index_mv_schema(mm.local(), *this, v);
});
});
});
});
@@ -809,7 +801,7 @@ future<> database::drop_column_family(const sstring& ks_name, const sstring& cf_
remove(*cf);
cf->clear_views();
auto& ks = find_keyspace(ks_name);
return cf->await_pending_ops().then([this, &ks, cf, tsf = std::move(tsf), snapshot] {
return when_all_succeed(cf->await_pending_writes(), cf->await_pending_reads()).then_unpack([this, &ks, cf, tsf = std::move(tsf), snapshot] {
return truncate(ks, *cf, std::move(tsf), snapshot).finally([this, cf] {
return cf->stop();
});
@@ -1751,11 +1743,7 @@ sstring database::get_available_index_name(const sstring &ks_name, const sstring
auto base_name = index_metadata::get_default_index_name(cf_name, index_name_root);
sstring accepted_name = base_name;
int i = 0;
auto name_accepted = [&] {
auto index_table_name = secondary_index::index_table_name(accepted_name);
return !has_schema(ks_name, index_table_name) && !existing_names.contains(accepted_name);
};
while (!name_accepted()) {
while (existing_names.contains(accepted_name)) {
accepted_name = base_name + "_" + std::to_string(++i);
}
return accepted_name;
@@ -1820,13 +1808,6 @@ future<>
database::stop() {
assert(!_large_data_handler->running());
// Inactive reads might hold on to sstables, blocking the
// `sstables_manager::close()` calls below. No one will come back for these
// reads at this point so clear them before proceeding with the shutdown.
_read_concurrency_sem.clear_inactive_reads();
_streaming_concurrency_sem.clear_inactive_reads();
_system_read_concurrency_sem.clear_inactive_reads();
// try to ensure that CL has done disk flushing
future<> maybe_shutdown_commitlog = _commitlog != nullptr ? _commitlog->shutdown() : make_ready_future<>();
return maybe_shutdown_commitlog.then([this] {
@@ -1878,28 +1859,26 @@ future<> database::truncate(const keyspace& ks, column_family& cf, timestamp_fun
return cf.run_with_compaction_disabled([this, &cf, should_flush, auto_snapshot, tsf = std::move(tsf), low_mark]() mutable {
future<> f = make_ready_future<>();
bool did_flush = false;
if (should_flush && cf.can_flush()) {
if (should_flush) {
// TODO:
// this is not really a guarantee at all that we've actually
// gotten all things to disk. Again, need queue-ish or something.
f = cf.flush();
did_flush = true;
} else {
f = cf.clear();
}
return f.then([this, &cf, auto_snapshot, tsf = std::move(tsf), low_mark, should_flush, did_flush] {
return f.then([this, &cf, auto_snapshot, tsf = std::move(tsf), low_mark, should_flush] {
dblog.debug("Discarding sstable data for truncated CF + indexes");
// TODO: notify truncation
return tsf().then([this, &cf, auto_snapshot, low_mark, should_flush, did_flush](db_clock::time_point truncated_at) {
return tsf().then([this, &cf, auto_snapshot, low_mark, should_flush](db_clock::time_point truncated_at) {
future<> f = make_ready_future<>();
if (auto_snapshot) {
auto name = format("{:d}-{}", truncated_at.time_since_epoch().count(), cf.schema()->cf_name());
f = cf.snapshot(*this, name);
}
return f.then([this, &cf, truncated_at, low_mark, should_flush, did_flush] {
return cf.discard_sstables(truncated_at).then([this, &cf, truncated_at, low_mark, should_flush, did_flush](db::replay_position rp) {
return f.then([this, &cf, truncated_at, low_mark, should_flush] {
return cf.discard_sstables(truncated_at).then([this, &cf, truncated_at, low_mark, should_flush](db::replay_position rp) {
// TODO: indexes.
// Note: since discard_sstables was changed to only count tables owned by this shard,
// we can get zero rp back. Changed assert, and ensure we save at least low_mark.
@@ -1907,7 +1886,7 @@ future<> database::truncate(const keyspace& ks, column_family& cf, timestamp_fun
// We nowadays do not flush tables with sstables but autosnapshot=false. This means
// the low_mark assertion does not hold, because we maybe/probably never got around to
// creating the sstables that would create them.
assert(!did_flush || low_mark <= rp || rp == db::replay_position());
assert(!should_flush || low_mark <= rp || rp == db::replay_position());
rp = std::max(low_mark, rp);
return truncate_views(cf, truncated_at, should_flush).then([&cf, truncated_at, rp] {
// save_truncation_record() may actually fail after we cached the truncation time

View File

@@ -224,10 +224,6 @@ public:
return bool(_seal_immediate_fn);
}
bool can_flush() const {
return may_flush() && !empty();
}
bool empty() const {
for (auto& m : _memtables) {
if (!m->empty()) {
@@ -509,8 +505,6 @@ private:
utils::phased_barrier _pending_reads_phaser;
// Corresponding phaser for in-progress streams
utils::phased_barrier _pending_streams_phaser;
// Corresponding phaser for in-progress flushes
utils::phased_barrier _pending_flushes_phaser;
// This field cashes the last truncation time for the table.
// The master resides in system.truncated table
@@ -771,23 +765,6 @@ public:
future<> clear(); // discards memtable(s) without flushing them to disk.
future<db::replay_position> discard_sstables(db_clock::time_point);
// Make sure the generation numbers are sequential, starting from "start".
// Generations before "start" are left untouched.
//
// Return the highest generation number seen so far
//
// Word of warning: although this function will reshuffle anything over "start", it is
// very dangerous to do that with live SSTables. This is meant to be used with SSTables
// that are not yet managed by the system.
//
// Parameter all_generations stores the generation of all SSTables in the system, so it
// will be easy to determine which SSTable is new.
// An example usage would query all shards asking what is the highest SSTable number known
// to them, and then pass that + 1 as "start".
future<std::vector<sstables::entry_descriptor>> reshuffle_sstables(std::set<int64_t> all_generations, int64_t start);
bool can_flush() const;
// FIXME: this is just an example, should be changed to something more
// general. compact_all_sstables() starts a compaction of all sstables.
// It doesn't flush the current memtable first. It's just a ad-hoc method,
@@ -940,14 +917,6 @@ public:
return _pending_streams_phaser.advance_and_await();
}
future<> await_pending_flushes() {
return _pending_flushes_phaser.advance_and_await();
}
future<> await_pending_ops() {
return when_all(await_pending_reads(), await_pending_writes(), await_pending_streams(), await_pending_flushes()).discard_result();
}
void add_or_update_view(view_ptr v);
void remove_view(view_ptr v);
void clear_views();

View File

@@ -31,7 +31,6 @@
#include <seastar/core/print.hh>
#include <seastar/util/log.hh>
#include "cdc/cdc_extension.hh"
#include "config.hh"
#include "extensions.hh"
#include "log.hh"
@@ -695,7 +694,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, replace_address(this, "replace_address", value_status::Used, "", "The listen_address or broadcast_address of the dead node to replace. Same as -Dcassandra.replace_address.")
, replace_address_first_boot(this, "replace_address_first_boot", value_status::Used, "", "Like replace_address option, but if the node has been bootstrapped successfully it will be ignored. Same as -Dcassandra.replace_address_first_boot.")
, override_decommission(this, "override_decommission", value_status::Used, false, "Set true to force a decommissioned node to join the cluster")
, enable_repair_based_node_ops(this, "enable_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, false, "Set true to use enable repair based node operations instead of streaming based")
, enable_repair_based_node_ops(this, "enable_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, true, "Set true to use enable repair based node operations instead of streaming based")
, ring_delay_ms(this, "ring_delay_ms", value_status::Used, 30 * 1000, "Time a node waits to hear from other nodes before joining the ring in milliseconds. Same as -Dcassandra.ring_delay_ms in cassandra.")
, shadow_round_ms(this, "shadow_round_ms", value_status::Used, 300 * 1000, "The maximum gossip shadow round time. Can be used to reduce the gossip feature check time during node boot up.")
, fd_max_interval_ms(this, "fd_max_interval_ms", value_status::Used, 2 * 1000, "The maximum failure_detector interval time in milliseconds. Interval larger than the maximum will be ignored. Larger cluster may need to increase the default.")
@@ -793,10 +792,6 @@ db::config::config()
db::config::~config()
{}
void db::config::add_cdc_extension() {
_extensions->add_schema_extension<cdc::cdc_extension>(cdc::cdc_extension::NAME);
}
void db::config::setup_directories() {
maybe_in_workdir(commitlog_directory, "commitlog");
maybe_in_workdir(data_file_directories, "data");
@@ -879,7 +874,7 @@ db::fs::path db::config::get_conf_sub(db::fs::path sub) {
}
bool db::config::check_experimental(experimental_features_t::feature f) const {
if (experimental() && f != experimental_features_t::UNUSED && f != experimental_features_t::UNUSED_CDC) {
if (experimental() && f != experimental_features_t::UNUSED) {
return true;
}
const auto& optval = experimental_features();
@@ -933,13 +928,11 @@ std::unordered_map<sstring, db::experimental_features_t::feature> db::experiment
// https://github.com/scylladb/scylla/pull/5369#discussion_r353614807
// Lightweight transactions are no longer experimental. Map them
// to UNUSED switch for a while, then remove altogether.
// Change Data Capture is no longer experimental. Map it
// to UNUSED_CDC switch for a while, then remove altogether.
return {{"lwt", UNUSED}, {"udf", UDF}, {"cdc", UNUSED_CDC}, {"alternator-streams", ALTERNATOR_STREAMS}};
return {{"lwt", UNUSED}, {"udf", UDF}, {"cdc", CDC}};
}
std::vector<enum_option<db::experimental_features_t>> db::experimental_features_t::all() {
return {UDF, ALTERNATOR_STREAMS};
return {UDF, CDC};
}
template struct utils::config_file::named_value<seastar::log_level>;

View File

@@ -81,7 +81,7 @@ namespace db {
/// Enumeration of all valid values for the `experimental` config entry.
struct experimental_features_t {
enum feature { UNUSED, UDF, UNUSED_CDC, ALTERNATOR_STREAMS };
enum feature { UNUSED, UDF, CDC };
static std::unordered_map<sstring, feature> map(); // See enum_option.
static std::vector<enum_option<experimental_features_t>> all();
};
@@ -92,9 +92,6 @@ public:
config(std::shared_ptr<db::extensions>);
~config();
// For testing only
void add_cdc_extension();
/// True iff the feature is enabled.
bool check_experimental(experimental_features_t::feature f) const;

View File

@@ -113,7 +113,7 @@ future<> cql_table_large_data_handler::record_large_cells(const sstables::sstabl
auto ck_str = key_to_str(*clustering_key, s);
return try_record("cell", sst, partition_key, int64_t(cell_size), cell_type, format("{} {}", ck_str, column_name), extra_fields, ck_str, column_name);
} else {
return try_record("cell", sst, partition_key, int64_t(cell_size), cell_type, column_name, extra_fields, data_value::make_null(utf8_type), column_name);
return try_record("cell", sst, partition_key, int64_t(cell_size), cell_type, column_name, extra_fields, nullptr, column_name);
}
}
@@ -125,7 +125,7 @@ future<> cql_table_large_data_handler::record_large_rows(const sstables::sstable
std::string ck_str = key_to_str(*clustering_key, s);
return try_record("row", sst, partition_key, int64_t(row_size), "row", ck_str, extra_fields, ck_str);
} else {
return try_record("row", sst, partition_key, int64_t(row_size), "static row", "", extra_fields, data_value::make_null(utf8_type));
return try_record("row", sst, partition_key, int64_t(row_size), "static row", "", extra_fields, nullptr);
}
}

View File

@@ -111,12 +111,27 @@ public:
return make_ready_future<>();
}
future<> maybe_delete_large_data_entries(const schema& /*s*/, sstring /*filename*/, uint64_t /*data_size*/) {
future<> maybe_delete_large_data_entries(const schema& s, sstring filename, uint64_t data_size) {
assert(running());
// Deletion of large data entries is disabled due to #7668
// They will evetually expire based on the 30 days TTL.
return make_ready_future<>();
future<> large_partitions = make_ready_future<>();
if (__builtin_expect(data_size > _partition_threshold_bytes, false)) {
large_partitions = with_sem([&s, filename, this] () mutable {
return delete_large_data_entries(s, std::move(filename), db::system_keyspace::LARGE_PARTITIONS);
});
}
future<> large_rows = make_ready_future<>();
if (__builtin_expect(data_size > _row_threshold_bytes, false)) {
large_rows = with_sem([&s, filename, this] () mutable {
return delete_large_data_entries(s, std::move(filename), db::system_keyspace::LARGE_ROWS);
});
}
future<> large_cells = make_ready_future<>();
if (__builtin_expect(data_size > _cell_threshold_bytes, false)) {
large_cells = with_sem([&s, filename, this] () mutable {
return delete_large_data_entries(s, std::move(filename), db::system_keyspace::LARGE_CELLS);
});
}
return when_all(std::move(large_partitions), std::move(large_rows), std::move(large_cells)).discard_result();
}
const large_data_handler::stats& stats() const { return _stats; }

View File

@@ -58,7 +58,6 @@
#include "schema_registry.hh"
#include "mutation_query.hh"
#include "system_keyspace.hh"
#include "system_distributed_keyspace.hh"
#include "cql3/cql3_type.hh"
#include "cql3/functions/functions.hh"
#include "cql3/util.hh"
@@ -105,11 +104,6 @@ using namespace std::chrono_literals;
static logging::logger diff_logger("schema_diff");
static bool is_extra_durable(const sstring& ks_name, const sstring& cf_name) {
return (is_system_keyspace(ks_name) && db::system_keyspace::is_extra_durable(cf_name))
|| (ks_name == db::system_distributed_keyspace::NAME && db::system_distributed_keyspace::is_extra_durable(cf_name));
}
/** system.schema_* tables used to store keyspace/table/type attributes prior to C* 3.0 */
namespace db {
@@ -1208,42 +1202,7 @@ static void merge_tables_and_views(distributed<service::storage_proxy>& proxy,
return create_table_from_mutations(proxy, std::move(sm));
});
auto views_diff = diff_table_or_view(proxy, std::move(views_before), std::move(views_after), [&] (schema_mutations sm) {
// The view schema mutation should be created with reference to the base table schema because we definitely know it by now.
// If we don't do it we are leaving a window where write commands to this schema are illegal.
// There are 3 possibilities:
// 1. The table was altered - in this case we want the view to correspond to this new table schema.
// 2. The table was just created - the table is guarantied to be published with the view in that case.
// 3. The view itself was altered - in that case we already know the base table so we can take it from
// the database object.
view_ptr vp = create_view_from_mutations(proxy, std::move(sm));
schema_ptr base_schema;
for (auto&& s : tables_diff.altered) {
if (s.new_schema.get()->ks_name() == vp->ks_name() && s.new_schema.get()->cf_name() == vp->view_info()->base_name() ) {
base_schema = s.new_schema;
break;
}
}
if (!base_schema) {
for (auto&& s : tables_diff.created) {
if (s.get()->ks_name() == vp->ks_name() && s.get()->cf_name() == vp->view_info()->base_name() ) {
base_schema = s;
break;
}
}
}
if (!base_schema) {
base_schema = proxy.local().local_db().find_schema(vp->ks_name(), vp->view_info()->base_name());
}
// Now when we have a referenced base - just in case we are registering an old view (this can happen in a mixed cluster)
// lets make it write enabled by updating it's compute columns.
view_ptr fixed_vp = maybe_fix_legacy_secondary_index_mv_schema(proxy.local().get_db().local(), vp, base_schema, preserve_version::yes);
if(fixed_vp) {
vp = fixed_vp;
}
vp->view_info()->set_base_info(vp->view_info()->make_base_dependent_view_info(*base_schema));
return vp;
return create_view_from_mutations(proxy, std::move(sm));
});
proxy.local().get_db().invoke_on_all([&] (database& db) {
@@ -2540,7 +2499,7 @@ schema_ptr create_table_from_mutations(const schema_ctxt& ctxt, schema_mutations
builder.with_sharder(smp::count, ctxt.murmur3_partitioner_ignore_msb_bits());
}
if (is_extra_durable(ks_name, cf_name)) {
if (is_system_keyspace(ks_name) && is_extra_durable(cf_name)) {
builder.set_wait_for_sync_to_commitlog(true);
}
@@ -3068,40 +3027,39 @@ std::vector<sstring> all_table_names(schema_features features) {
boost::adaptors::transformed([] (auto schema) { return schema->cf_name(); }));
}
view_ptr maybe_fix_legacy_secondary_index_mv_schema(database& db, const view_ptr& v, schema_ptr base_schema, preserve_version preserve_version) {
future<> maybe_update_legacy_secondary_index_mv_schema(service::migration_manager& mm, database& db, view_ptr v) {
// TODO(sarna): Remove once computed columns are guaranteed to be featured in the whole cluster.
// Legacy format for a secondary index used a hardcoded "token" column, which ensured a proper
// order for indexed queries. This "token" column is now implemented as a computed column,
// but for the sake of compatibility we assume that there might be indexes created in the legacy
// format, where "token" is not marked as computed. Once we're sure that all indexes have their
// columns marked as computed (because they were either created on a node that supports computed
// columns or were fixed by this utility function), it's safe to remove this function altogether.
if (!db.features().cluster_supports_computed_columns()) {
return make_ready_future<>();
}
if (v->clustering_key_size() == 0) {
return view_ptr(nullptr);
return make_ready_future<>();
}
const column_definition& first_view_ck = v->clustering_key_columns().front();
if (first_view_ck.is_computed()) {
return view_ptr(nullptr);
}
if (!base_schema) {
base_schema = db.find_schema(v->view_info()->base_id());
return make_ready_future<>();
}
table& base = db.find_column_family(v->view_info()->base_id());
schema_ptr base_schema = base.schema();
// If the first clustering key part of a view is a column with name not found in base schema,
// it implies it might be backing an index created before computed columns were introduced,
// and as such it must be recreated properly.
if (!base_schema->columns_by_name().contains(first_view_ck.name())) {
schema_builder builder{schema_ptr(v)};
builder.mark_column_computed(first_view_ck.name(), std::make_unique<token_column_computation>());
if (preserve_version) {
builder.with_version(v->version());
}
return view_ptr(builder.build());
return mm.announce_view_update(view_ptr(builder.build()), true);
}
return view_ptr(nullptr);
return make_ready_future<>();
}
namespace legacy {
table_schema_version schema_mutations::digest() const {

View File

@@ -238,9 +238,7 @@ std::vector<mutation> make_update_view_mutations(lw_shared_ptr<keyspace_metadata
std::vector<mutation> make_drop_view_mutations(lw_shared_ptr<keyspace_metadata> keyspace, view_ptr view, api::timestamp_type timestamp);
class preserve_version_tag {};
using preserve_version = bool_class<preserve_version_tag>;
view_ptr maybe_fix_legacy_secondary_index_mv_schema(database& db, const view_ptr& v, schema_ptr base_schema, preserve_version preserve_version);
future<> maybe_update_legacy_secondary_index_mv_schema(service::migration_manager& mm, database& db, view_ptr v);
sstring serialize_kind(column_kind kind);
column_kind deserialize_kind(sstring kind);

View File

@@ -201,10 +201,10 @@ static future<std::vector<token_range>> get_local_ranges(database& db) {
// All queries will be on that table, where all entries are text and there's no notion of
// token ranges form the CQL point of view.
auto left_inf = boost::find_if(ranges, [] (auto&& r) {
return r.end() && (!r.start() || r.start()->value() == dht::minimum_token());
return !r.start() || r.start()->value() == dht::minimum_token();
});
auto right_inf = boost::find_if(ranges, [] (auto&& r) {
return r.start() && (!r.end() || r.end()->value() == dht::maximum_token());
return !r.end() || r.start()->value() == dht::maximum_token();
});
if (left_inf != right_inf && left_inf != ranges.end() && right_inf != ranges.end()) {
local_ranges.push_back(token_range{to_bytes(right_inf->start()), to_bytes(left_inf->end())});

View File

@@ -43,13 +43,9 @@
namespace db {
future<> snapshot_ctl::check_snapshot_not_exist(sstring ks_name, sstring name, std::optional<std::vector<sstring>> filter) {
future<> snapshot_ctl::check_snapshot_not_exist(sstring ks_name, sstring name) {
auto& ks = _db.local().find_keyspace(ks_name);
return parallel_for_each(ks.metadata()->cf_meta_data(), [this, ks_name = std::move(ks_name), name = std::move(name), filter = std::move(filter)] (auto& pair) {
auto& cf_name = pair.first;
if (filter && std::find(filter->begin(), filter->end(), cf_name) == filter->end()) {
return make_ready_future<>();
}
return parallel_for_each(ks.metadata()->cf_meta_data(), [this, ks_name = std::move(ks_name), name = std::move(name)] (auto& pair) {
auto& cf = _db.local().find_column_family(pair.second);
return cf.snapshot_exists(name).then([ks_name = std::move(ks_name), name] (bool exists) {
if (exists) {
@@ -115,7 +111,7 @@ future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<
}
return run_snapshot_modify_operation([this, ks_name = std::move(ks_name), tables = std::move(tables), tag = std::move(tag)] {
return check_snapshot_not_exist(ks_name, tag, tables).then([this, ks_name, tables, tag] {
return check_snapshot_not_exist(ks_name, tag).then([this, ks_name, tables = std::move(tables), tag] {
return do_with(std::vector<sstring>(std::move(tables)),[this, ks_name, tag](const std::vector<sstring>& tables) {
return do_for_each(tables, [ks_name, tag, this] (const sstring& table_name) {
if (table_name.find(".") != sstring::npos) {

View File

@@ -40,8 +40,6 @@
#pragma once
#include <vector>
#include <seastar/core/sharded.hh>
#include <seastar/core/future.hh>
#include "database.hh"
@@ -114,7 +112,7 @@ private:
seastar::rwlock _lock;
seastar::gate _ops;
future<> check_snapshot_not_exist(sstring ks_name, sstring name, std::optional<std::vector<sstring>> filter = {});
future<> check_snapshot_not_exist(sstring ks_name, sstring name);
template <typename Func>
std::result_of_t<Func()> run_snapshot_modify_operation(Func&&);

View File

@@ -113,10 +113,6 @@ static std::vector<schema_ptr> all_tables() {
};
}
bool system_distributed_keyspace::is_extra_durable(const sstring& cf_name) {
return cf_name == CDC_TOPOLOGY_DESCRIPTION;
}
system_distributed_keyspace::system_distributed_keyspace(cql3::query_processor& qp, service::migration_manager& mm)
: _qp(qp)
, _mm(mm) {

View File

@@ -64,10 +64,6 @@ private:
service::migration_manager& _mm;
public:
/* Should writes to the given table always be synchronized by commitlog (flushed to disk)
* before being acknowledged? */
static bool is_extra_durable(const sstring& cf_name);
system_distributed_keyspace(cql3::query_processor&, service::migration_manager&);
future<> start();

View File

@@ -1241,14 +1241,6 @@ future<> mutate_MV(
}
}
}
// It's still possible that a target endpoint is dupliated in the remote endpoints list,
// so let's get rid of the duplicate if it exists
if (target_endpoint) {
auto remote_it = std::find(remote_endpoints.begin(), remote_endpoints.end(), *target_endpoint);
if (remote_it != remote_endpoints.end()) {
remote_endpoints.erase(remote_it);
}
}
if (target_endpoint && *target_endpoint == my_address) {
++stats.view_updates_pushed_local;

View File

@@ -58,8 +58,7 @@ public:
template<typename T, typename... Args>
void feed_hash(const T& value, Args&&... args) {
// FIXME uncomment the noexcept marking once clang bug 50994 is fixed or gcc compilation is turned on
std::visit([&] (auto& hasher) /* noexcept(noexcept(::feed_hash(hasher, value, args...))) */ -> void {
std::visit([&] (auto& hasher) noexcept -> void {
::feed_hash(hasher, value, std::forward<Args>(args)...);
}, _impl);
};

View File

@@ -24,8 +24,6 @@ import os
import sys
import tempfile
import tarfile
import shutil
import glob
from scylla_util import *
import argparse
@@ -63,9 +61,6 @@ if __name__ == '__main__':
f.write(data)
with tarfile.open('/var/tmp/node_exporter-{version}.linux-amd64.tar.gz'.format(version=VERSION)) as tf:
tf.extractall(INSTALL_DIR)
shutil.chown(f'{INSTALL_DIR}/node_exporter-{VERSION}.linux-amd64', 'root', 'root')
for f in glob.glob(f'{INSTALL_DIR}/node_exporter-{VERSION}.linux-amd64/*'):
shutil.chown(f, 'root', 'root')
os.remove('/var/tmp/node_exporter-{version}.linux-amd64.tar.gz'.format(version=VERSION))
if node_exporter_p.exists():
node_exporter_p.unlink()

View File

@@ -87,8 +87,7 @@ WantedBy=multi-user.target
run('sysctl -p /etc/sysctl.d/99-scylla-coredump.conf')
fp = tempfile.NamedTemporaryFile()
fp.write(b'ulimit -c unlimited\n')
fp.write(b'kill -SEGV $$\n')
fp.write(b'kill -SEGV $$')
fp.flush()
p = subprocess.Popen(['/bin/bash', fp.name], stdout=subprocess.PIPE)
pid = p.pid

View File

@@ -22,7 +22,6 @@
import os
import sys
import argparse
import shlex
import distro
from scylla_util import *
@@ -34,22 +33,12 @@ if __name__ == '__main__':
if os.getuid() > 0:
print('Requires root permission.')
sys.exit(1)
parser = argparse.ArgumentParser(description='CPU scaling setup script for Scylla.')
parser.add_argument('--force', dest='force', action='store_true',
help='force running setup even CPU scaling unsupported')
args = parser.parse_args()
if not args.force and not os.path.exists('/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'):
if not os.path.exists('/sys/devices/system/cpu/cpufreq/policy0/scaling_governor'):
print('This computer doesn\'t supported CPU scaling configuration.')
sys.exit(0)
if is_debian_variant():
if not shutil.which('cpufreq-set'):
apt_install('cpufrequtils')
try:
ondemand = systemd_unit('ondemand')
ondemand.disable()
except:
pass
cfg = sysconfig_parser('/etc/default/cpufrequtils')
cfg.set('GOVERNOR', 'performance')
cfg.commit()

View File

@@ -244,12 +244,12 @@ if __name__ == "__main__":
# and https://cloud.google.com/compute/docs/disks/local-ssd#nvme
# note that scylla iotune might measure more, this is GCP recommended
mbs=1024*1024
if nr_disks >= 1 and nr_disks < 4:
if nr_disks >= 1 & nr_disks < 4:
disk_properties["read_iops"] = 170000 * nr_disks
disk_properties["read_bandwidth"] = 660 * mbs * nr_disks
disk_properties["write_iops"] = 90000 * nr_disks
disk_properties["write_bandwidth"] = 350 * mbs * nr_disks
elif nr_disks >= 4 and nr_disks <= 8:
elif nr_disks >= 4 & nr_disks <= 8:
disk_properties["read_iops"] = 680000
disk_properties["read_bandwidth"] = 2650 * mbs
disk_properties["write_iops"] = 360000

View File

@@ -91,12 +91,12 @@ if __name__ == '__main__':
with open('/etc/ntp.conf') as f:
conf = f.read()
if args.subdomain:
conf2 = re.sub(r'(server|pool)\s+([0-9]+)\.(\S+)\.pool\.ntp\.org', '\\1 \\2.{}.pool.ntp.org'.format(args.subdomain), conf, flags=re.MULTILINE)
conf2 = re.sub(r'server\s+([0-9]+)\.(\S+)\.pool\.ntp\.org', 'server \\1.{}.pool.ntp.org'.format(args.subdomain), conf, flags=re.MULTILINE)
with open('/etc/ntp.conf', 'w') as f:
f.write(conf2)
conf = conf2
match = re.search(r'^(server|pool)\s+(\S*)(\s+\S+)?', conf, flags=re.MULTILINE)
server = match.group(2)
match = re.search(r'^server\s+(\S*)(\s+\S+)?', conf, flags=re.MULTILINE)
server = match.group(1)
ntpd = systemd_unit('ntpd.service')
ntpd.stop()
# ignore error, ntpd may able to adjust clock later

View File

@@ -143,3 +143,4 @@ if __name__ == '__main__':
print(f'Exception occurred while creating perftune.yaml: {e}')
print('To fix the error, please re-run scylla_setup.')
sys.exit(1)

View File

@@ -36,7 +36,7 @@ if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Configure RAID volume for Scylla.')
parser.add_argument('--disks', required=True,
help='specify disks for RAID')
parser.add_argument('--raiddev',
parser.add_argument('--raiddev', default='/dev/md0',
help='MD device name for RAID')
parser.add_argument('--enable-on-nextboot', '--update-fstab', action='store_true', default=False,
help='mount RAID on next boot')
@@ -73,25 +73,9 @@ if __name__ == '__main__':
print('{} is busy'.format(disk))
sys.exit(1)
if len(disks) == 1 and not args.force_raid:
raid = False
fsdev = disks[0]
else:
raid = True
if args.raiddev is None:
raiddevs_to_try = [f'/dev/md{i}' for i in range(10)]
else:
raiddevs_to_try = [args.raiddev, ]
for fsdev in raiddevs_to_try:
raiddevname = os.path.basename(fsdev)
if not os.path.exists(f'/sys/block/{raiddevname}/md/array_state'):
break
print(f'{fsdev} is already using')
else:
if args.raiddev is None:
print("Can't find unused /dev/mdX")
sys.exit(1)
print(f'{fsdev} will be used to setup a RAID')
if os.path.exists(args.raiddev):
print('{} is already using'.format(args.raiddev))
sys.exit(1)
if os.path.ismount(mount_at):
print('{} is already mounted'.format(mount_at))
@@ -110,6 +94,13 @@ if __name__ == '__main__':
except SystemdException:
md_service = systemd_unit('mdadm.service')
if len(disks) == 1 and not args.force_raid:
raid = False
fsdev = disks[0]
else:
raid = True
fsdev = args.raiddev
print('Creating {type} for scylla using {nr_disk} disk(s): {disks}'.format(type='RAID0' if raid else 'XFS volume', nr_disk=len(disks), disks=args.disks))
if distro.name() == 'Ubuntu' and distro.version() == '14.04':
if raid:
@@ -160,7 +151,7 @@ Before=scylla-server.service
After={after}
[Mount]
What=/dev/disk/by-uuid/{uuid}
What=UUID={uuid}
Where={mount_at}
Type=xfs
Options=noatime

View File

@@ -34,7 +34,6 @@ from pathlib import Path
import distro
from multiprocessing import cpu_count
def scriptsdir_p():
p = Path(sys.argv[0]).resolve()
@@ -93,7 +92,7 @@ def scyllabindir():
# @param headers dict of k:v
def curl(url, headers=None, byte=False, timeout=3, max_retries=5, retry_interval=5):
def curl(url, headers=None, byte=False, timeout=3, max_retries=5):
retries = 0
while True:
try:
@@ -103,8 +102,9 @@ def curl(url, headers=None, byte=False, timeout=3, max_retries=5, retry_interval
return res.read()
else:
return res.read().decode('utf-8')
except urllib.error.URLError:
time.sleep(retry_interval)
except urllib.error.HTTPError:
logging.warning("Failed to grab %s..." % url)
time.sleep(5)
retries += 1
if retries >= max_retries:
raise
@@ -188,7 +188,7 @@ class gcp_instance:
"""get list of nvme disks from metadata server"""
import json
try:
disksREST=self.__instance_metadata("disks", True)
disksREST=self.__instance_metadata("disks")
disksobj=json.loads(disksREST)
nvmedisks=list(filter(self.isNVME, disksobj))
except Exception as e:
@@ -236,8 +236,7 @@ class gcp_instance:
def instance_size(self):
"""Returns the size of the instance we are running in. i.e.: 2"""
instancetypesplit = self.instancetype.split("-")
return instancetypesplit[2] if len(instancetypesplit)>2 else 0
return self.instancetype.split("-")[2]
def instance_class(self):
"""Returns the class of the instance we are running in. i.e.: n2"""
@@ -299,31 +298,22 @@ class gcp_instance:
return self.__firstNvmeSize
def is_recommended_instance(self):
if not self.is_unsupported_instance_class() and self.is_supported_instance_class() and self.is_recommended_instance_size():
if self.is_recommended_instance_size() and not self.is_unsupported_instance_class() and self.is_supported_instance_class():
# at least 1:2GB cpu:ram ratio , GCP is at 1:4, so this should be fine
if self.cpu/self.memoryGB < 0.5:
diskCount = self.nvmeDiskCount
# to reach max performance for > 16 disks we mandate 32 or more vcpus
# https://cloud.google.com/compute/docs/disks/local-ssd#performance
if diskCount >= 16 and self.cpu < 32:
logging.warning(
"This machine doesn't have enough CPUs for allocated number of NVMEs (at least 32 cpus for >=16 disks). Performance will suffer.")
return False
if diskCount < 1:
logging.warning("No ephemeral disks were found.")
return False
diskSize = self.firstNvmeSize
max_disktoramratio = 105
# 30:1 Disk/RAM ratio must be kept at least(AWS), we relax this a little bit
# on GCP we are OK with {max_disktoramratio}:1 , n1-standard-2 can cope with 1 disk, not more
disktoramratio = (diskCount * diskSize) / self.memoryGB
if (disktoramratio > max_disktoramratio):
logging.warning(
f"Instance disk-to-RAM ratio is {disktoramratio}, which is higher than the recommended ratio {max_disktoramratio}. Performance may suffer.")
return False
return True
else:
logging.warning("At least 2G of RAM per CPU is needed. Performance will suffer.")
# 30:1 Disk/RAM ratio must be kept at least(AWS), we relax this a little bit
# on GCP we are OK with 50:1 , n1-standard-2 can cope with 1 disk, not more
diskCount = self.nvmeDiskCount
# to reach max performance for > 16 disks we mandate 32 or more vcpus
# https://cloud.google.com/compute/docs/disks/local-ssd#performance
if diskCount >= 16 and self.cpu < 32:
return False
diskSize= self.firstNvmeSize
if diskCount < 1:
return False
disktoramratio = (diskCount*diskSize)/self.memoryGB
if (disktoramratio <= 50) and (disktoramratio > 0):
return True
return False
def private_ipv4(self):
@@ -375,8 +365,6 @@ class aws_instance:
raise Exception("found more than one disk mounted at root'".format(root_dev_candidates))
root_dev = root_dev_candidates[0].device
if root_dev == '/dev/root':
root_dev = run('findmnt -n -o SOURCE /', shell=True, check=True, capture_output=True, encoding='utf-8').stdout.strip()
nvmes_present = list(filter(nvme_re.match, os.listdir("/dev")))
return {"root": [ root_dev ], "ephemeral": [ x for x in nvmes_present if not root_dev.startswith(os.path.join("/dev/", x)) ] }
@@ -410,7 +398,7 @@ class aws_instance:
def is_aws_instance(cls):
"""Check if it's AWS instance via query to metadata server."""
try:
curl(cls.META_DATA_BASE_URL, max_retries=2, retry_interval=1)
curl(cls.META_DATA_BASE_URL, max_retries=2)
return True
except (urllib.error.URLError, urllib.error.HTTPError):
return False
@@ -474,7 +462,7 @@ class aws_instance:
def ebs_disks(self):
"""Returns all EBS disks"""
return set(self._disks["ebs"])
return set(self._disks["ephemeral"])
def public_ipv4(self):
"""Returns the public IPv4 address of this instance"""
@@ -502,7 +490,9 @@ class aws_instance:
return curl(self.META_DATA_BASE_URL + "user-data")
# When a CLI tool is not installed, use relocatable CLI tool provided by Scylla
scylla_env = os.environ.copy()
scylla_env['PATH'] = '{}:{}'.format(scyllabindir(), scylla_env['PATH'])
scylla_env['DEBIAN_FRONTEND'] = 'noninteractive'
def run(cmd, shell=False, silent=False, exception=True):

View File

@@ -1,2 +1,2 @@
# Raise max AIO events
fs.aio-max-nr = 5578536
fs.aio-max-nr = 1048576

View File

@@ -1,4 +0,0 @@
# allocate enough inotify instances for large machines
# each tls instance needs 1 inotify instance, and there can be
# multiple tls instances per shard.
fs.inotify.max_user_instances = 1200

View File

@@ -1,5 +1,7 @@
[Unit]
Description=Run Scylla fstrim daily
After=scylla-server.service
BindsTo=scylla-server.service
[Timer]
OnCalendar=Sat *-*-* 00:00:00

View File

@@ -9,9 +9,8 @@ if [[ $KVER =~ 3\.13\.0\-([0-9]+)-generic ]]; then
else
# expect failures in virtualized environments
sysctl -p/usr/lib/sysctl.d/99-scylla-sched.conf || :
sysctl -p/usr/lib/sysctl.d/99-scylla-vm.conf || :
sysctl -p/usr/lib/sysctl.d/99-scylla-inotify.conf || :
sysctl -p/usr/lib/sysctl.d/99-scylla-aio.conf || :
sysctl -p/usr/lib/sysctl.d/99-scylla-vm.conf || :
fi
#DEBHELPER#

View File

@@ -12,6 +12,8 @@ case "$1" in
if [ "$1" = "purge" ]; then
rm -rf /etc/systemd/system/scylla-server.service.d/
fi
rm -f /etc/systemd/system/var-lib-systemd-coredump.mount
rm -f /etc/systemd/system/var-lib-scylla.mount
;;
esac

View File

@@ -5,8 +5,8 @@ MAINTAINER Avi Kivity <avi@cloudius-systems.com>
ENV container docker
# The SCYLLA_REPO_URL argument specifies the URL to the RPM repository this Docker image uses to install Scylla. The default value is the Scylla's unstable RPM repository, which contains the daily build.
ARG SCYLLA_REPO_URL=http://downloads.scylladb.com/rpm/unstable/centos/scylla-4.3/latest/scylla.repo
ARG VERSION=4.3.rc0
ARG SCYLLA_REPO_URL=http://downloads.scylladb.com/rpm/unstable/centos/master/latest/scylla.repo
ARG VERSION=666.development
ADD scylla_bashrc /scylla_bashrc

View File

@@ -7,7 +7,7 @@ Group: Applications/Databases
License: AGPLv3
URL: http://www.scylladb.com/
Source0: %{reloc_pkg}
Requires: %{product}-server = %{version} %{product}-conf = %{version} %{product}-python3 = %{version} %{product}-kernel-conf = %{version} %{product}-jmx = %{version} %{product}-tools = %{version} %{product}-tools-core = %{version}
Requires: %{product}-server = %{version} %{product}-conf = %{version} %{product}-kernel-conf = %{version} %{product}-jmx = %{version} %{product}-tools = %{version} %{product}-tools-core = %{version}
Obsoletes: scylla-server < 1.1
%global _debugsource_template %{nil}
@@ -52,7 +52,7 @@ Summary: The Scylla database server
License: AGPLv3
URL: http://www.scylladb.com/
Requires: kernel >= 3.10.0-514
Requires: %{product}-conf = %{version} %{product}-python3 = %{version}
Requires: %{product}-conf %{product}-python3
Conflicts: abrt
AutoReqProv: no
@@ -76,18 +76,13 @@ getent passwd scylla || /usr/sbin/useradd -g scylla -s /sbin/nologin -r -d %{_sh
%post server
/opt/scylladb/scripts/scylla_post_install.sh
if [ $1 -eq 1 ] ; then
/usr/bin/systemctl preset scylla-server.service ||:
fi
%systemd_post scylla-server.service
%preun server
if [ $1 -eq 0 ] ; then
/usr/bin/systemctl --no-reload disable scylla-server.service ||:
/usr/bin/systemctl stop scylla-server.service ||:
fi
%systemd_preun scylla-server.service
%postun server
/usr/bin/systemctl daemon-reload ||:
%systemd_postun scylla-server.service
%posttrans server
if [ -d /tmp/%{name}-%{version}-%{release} ]; then
@@ -134,12 +129,13 @@ rm -rf $RPM_BUILD_ROOT
%attr(0755,scylla,scylla) %dir %{_sharedstatedir}/scylla-housekeeping
%ghost /etc/systemd/system/scylla-helper.slice.d/
%ghost /etc/systemd/system/scylla-helper.slice.d/memory.conf
%ghost /etc/systemd/system/scylla-server.service.d/
%ghost /etc/systemd/system/scylla-server.service.d/capabilities.conf
%ghost /etc/systemd/system/scylla-server.service.d/mounts.conf
/etc/systemd/system/scylla-server.service.d/dependencies.conf
%ghost %config /etc/systemd/system/var-lib-systemd-coredump.mount
%ghost /etc/systemd/system/scylla-server.service.d/dependencies.conf
%ghost /etc/systemd/system/var-lib-systemd-coredump.mount
%ghost /etc/systemd/system/scylla-cpupower.service
%ghost %config /etc/systemd/system/var-lib-scylla.mount
%ghost /etc/systemd/system/var-lib-scylla.mount
%package conf
Group: Applications/Databases
@@ -194,8 +190,6 @@ Summary: Scylla configuration package for the Linux kernel
License: AGPLv3
URL: http://www.scylladb.com/
Requires: kmod
# tuned overwrites our sysctl settings
Obsoletes: tuned
%description kernel-conf
This package contains Linux kernel configuration changes for the Scylla database. Install this package
@@ -205,9 +199,8 @@ if Scylla is the main application on your server and you wish to optimize its la
# We cannot use the sysctl_apply rpm macro because it is not present in 7.0
# following is a "manual" expansion
/usr/lib/systemd/systemd-sysctl 99-scylla-sched.conf >/dev/null 2>&1 || :
/usr/lib/systemd/systemd-sysctl 99-scylla-vm.conf >/dev/null 2>&1 || :
/usr/lib/systemd/systemd-sysctl 99-scylla-inotify.conf >/dev/null 2>&1 || :
/usr/lib/systemd/systemd-sysctl 99-scylla-aio.conf >/dev/null 2>&1 || :
/usr/lib/systemd/systemd-sysctl 99-scylla-vm.conf >/dev/null 2>&1 || :
%files kernel-conf
%defattr(-,root,root)

View File

@@ -1,5 +1,5 @@
# Getting Started With ScyllaDB Alternator
---
## Installing Scylla
Before you can start using ScyllaDB Alternator, you will have to have an up
and running scylla cluster configured to expose the alternator port.

View File

@@ -143,7 +143,6 @@ extern const std::string_view LWT;
extern const std::string_view PER_TABLE_PARTITIONERS;
extern const std::string_view PER_TABLE_CACHING;
extern const std::string_view DIGEST_FOR_NULL_VALUES;
extern const std::string_view ALTERNATOR_STREAMS;
}

View File

@@ -62,7 +62,6 @@ constexpr std::string_view features::LWT = "LWT";
constexpr std::string_view features::PER_TABLE_PARTITIONERS = "PER_TABLE_PARTITIONERS";
constexpr std::string_view features::PER_TABLE_CACHING = "PER_TABLE_CACHING";
constexpr std::string_view features::DIGEST_FOR_NULL_VALUES = "DIGEST_FOR_NULL_VALUES";
constexpr std::string_view features::ALTERNATOR_STREAMS = "ALTERNATOR_STREAMS";
static logging::logger logger("features");
@@ -87,7 +86,6 @@ feature_service::feature_service(feature_config cfg) : _config(cfg)
, _per_table_partitioners_feature(*this, features::PER_TABLE_PARTITIONERS)
, _per_table_caching_feature(*this, features::PER_TABLE_CACHING)
, _digest_for_null_values_feature(*this, features::DIGEST_FOR_NULL_VALUES)
, _alternator_streams_feature(*this, features::ALTERNATOR_STREAMS)
{}
feature_config feature_config_from_db_config(db::config& cfg, std::set<sstring> disabled) {
@@ -118,8 +116,8 @@ feature_config feature_config_from_db_config(db::config& cfg, std::set<sstring>
}
}
if (!cfg.check_experimental(db::experimental_features_t::ALTERNATOR_STREAMS)) {
fcfg._disabled_features.insert(sstring(gms::features::ALTERNATOR_STREAMS));
if (!cfg.check_experimental(db::experimental_features_t::CDC)) {
fcfg._disabled_features.insert(sstring(gms::features::CDC));
}
return fcfg;
@@ -189,7 +187,6 @@ std::set<std::string_view> feature_service::known_feature_set() {
gms::features::UDF,
gms::features::CDC,
gms::features::DIGEST_FOR_NULL_VALUES,
gms::features::ALTERNATOR_STREAMS,
};
for (const sstring& s : _config._disabled_features) {
@@ -269,7 +266,6 @@ void feature_service::enable(const std::set<std::string_view>& list) {
std::ref(_per_table_partitioners_feature),
std::ref(_per_table_caching_feature),
std::ref(_digest_for_null_values_feature),
std::ref(_alternator_streams_feature),
})
{
if (list.contains(f.name())) {

View File

@@ -92,7 +92,6 @@ private:
gms::feature _per_table_partitioners_feature;
gms::feature _per_table_caching_feature;
gms::feature _digest_for_null_values_feature;
gms::feature _alternator_streams_feature;
public:
bool cluster_supports_user_defined_functions() const {
@@ -161,10 +160,6 @@ public:
bool cluster_supports_lwt() const {
return bool(_lwt_feature);
}
bool cluster_supports_alternator_streams() const {
return bool(_alternator_streams_feature);
}
};
} // namespace gms

View File

@@ -1774,8 +1774,6 @@ future<> gossiper::do_shadow_round(std::unordered_set<gms::inet_address> nodes,
}).handle_exception_type([node, &fall_back_to_syn_msg] (seastar::rpc::unknown_verb_error&) {
logger.warn("Node {} does not support get_endpoint_states verb", node);
fall_back_to_syn_msg = true;
}).handle_exception_type([node, &nodes_down] (seastar::rpc::timeout_error&) {
logger.warn("The get_endpoint_states verb to node {} was timeout", node);
}).handle_exception_type([node, &nodes_down] (seastar::rpc::closed_error&) {
nodes_down++;
logger.warn("Node {} is down for get_endpoint_states verb", node);

View File

@@ -62,7 +62,7 @@ struct appending_hash;
template<typename H, typename T, typename... Args>
requires Hasher<H>
inline
void feed_hash(H& h, const T& value, Args&&... args) noexcept(noexcept(std::declval<appending_hash<T>>()(h, value, args...))) {
void feed_hash(H& h, const T& value, Args&&... args) noexcept {
appending_hash<T>()(h, value, std::forward<Args>(args)...);
};

View File

@@ -28,6 +28,7 @@ else
fi
debian_base_packages=(
clang
liblua5.3-dev
python3-pyparsing
python3-colorama
@@ -44,6 +45,7 @@ debian_base_packages=(
)
fedora_packages=(
clang
lua-devel
yaml-cpp-devel
thrift-devel
@@ -57,6 +59,7 @@ fedora_packages=(
python
sudo
java-1.8.0-openjdk-headless
java-1.8.0-openjdk-devel
ant
ant-junit
maven

View File

@@ -106,6 +106,7 @@ adjust_bin() {
"$root/$prefix/libexec/$bin"
cat > "$root/$prefix/bin/$bin" <<EOF
#!/bin/bash -e
[[ -z "\$LD_PRELOAD" ]] || { echo "\$0: not compatible with LD_PRELOAD" >&2; exit 110; }
export GNUTLS_SYSTEM_PRIORITY_FILE="\${GNUTLS_SYSTEM_PRIORITY_FILE-$prefix/libreloc/gnutls.config}"
export LD_LIBRARY_PATH="$prefix/libreloc"
exec -a "\$0" "$prefix/libexec/$bin" "\$@"
@@ -130,6 +131,7 @@ relocate_python3() {
cp "$script" "$relocateddir"
cat > "$install"<<EOF
#!/usr/bin/env bash
[[ -z "\$LD_PRELOAD" ]] || { echo "\$0: not compatible with LD_PRELOAD" >&2; exit 110; }
export LC_ALL=en_US.UTF-8
x="\$(readlink -f "\$0")"
b="\$(basename "\$x")"
@@ -142,15 +144,11 @@ DEBIAN_SSL_CERT_FILE="/etc/ssl/certs/ca-certificates.crt"
if [ -f "\${DEBIAN_SSL_CERT_FILE}" ]; then
c=\${DEBIAN_SSL_CERT_FILE}
fi
PYTHONPATH="\${d}:\${d}/libexec:\$PYTHONPATH" PATH="\${d}/../bin:\${d}/$pythonpath:\${PATH}" SSL_CERT_FILE="\${c}" exec -a "\$0" "\${d}/libexec/\${b}" "\$@"
PYTHONPATH="\${d}:\${d}/libexec:\$PYTHONPATH" PATH="\${d}/$pythonpath:\${PATH}" SSL_CERT_FILE="\${c}" exec -a "\$0" "\${d}/libexec/\${b}" "\$@"
EOF
chmod +x "$install"
}
install() {
command install -Z "$@"
}
installconfig() {
local perm="$1"
local src="$2"
@@ -201,13 +199,13 @@ if [ -z "$python3" ]; then
fi
rpython3=$(realpath -m "$root/$python3")
if ! $nonroot; then
retc=$(realpath -m "$root/etc")
rsysconfdir=$(realpath -m "$root/$sysconfdir")
rusr=$(realpath -m "$root/usr")
rsystemd=$(realpath -m "$rusr/lib/systemd/system")
retc="$root/etc"
rsysconfdir="$root/$sysconfdir"
rusr="$root/usr"
rsystemd="$rusr/lib/systemd/system"
rdoc="$rprefix/share/doc"
rdata=$(realpath -m "$root/var/lib/scylla")
rhkdata=$(realpath -m "$root/var/lib/scylla-housekeeping")
rdata="$root/var/lib/scylla"
rhkdata="$root/var/lib/scylla-housekeeping"
else
retc="$rprefix/etc"
rsysconfdir="$rprefix/$sysconfdir"
@@ -416,10 +414,6 @@ elif ! $packaging; then
chown -R scylla:scylla $rdata
chown -R scylla:scylla $rhkdata
for file in dist/common/sysctl.d/*.conf; do
bn=$(basename "$file")
sysctl -p "$rusr"/lib/sysctl.d/"$bn"
done
$rprefix/scripts/scylla_post_install.sh
echo "Scylla offline install completed."
fi

View File

@@ -1023,7 +1023,8 @@ int main(int ac, char** av) {
proxy.invoke_on_all([] (service::storage_proxy& local_proxy) {
auto& ss = service::get_local_storage_service();
ss.register_subscriber(&local_proxy);
return local_proxy.start_hints_manager(gms::get_local_gossiper().shared_from_this(), ss.shared_from_this());
//FIXME: discarded future
(void)local_proxy.start_hints_manager(gms::get_local_gossiper().shared_from_this(), ss.shared_from_this());
}).get();
supervisor::notify("starting messaging service");

View File

@@ -1151,9 +1151,6 @@ flat_mutation_reader evictable_reader::recreate_reader() {
_range_override.reset();
_slice_override.reset();
_drop_partition_start = false;
_drop_static_row = false;
if (_last_pkey) {
bool partition_range_is_inclusive = true;
@@ -1239,25 +1236,13 @@ void evictable_reader::maybe_validate_partition_start(const flat_mutation_reader
// is in range.
if (_last_pkey) {
const auto cmp_res = tri_cmp(*_last_pkey, ps.key());
if (_drop_partition_start) { // we expect to continue from the same partition
// We cannot assume the partition we stopped the read at is still alive
// when we recreate the reader. It might have been compacted away in the
// meanwhile, so allow for a larger partition too.
if (_drop_partition_start) { // should be the same partition
require(
cmp_res <= 0,
"{}(): validation failed, expected partition with key larger or equal to _last_pkey {} due to _drop_partition_start being set, but got {}",
cmp_res == 0,
"{}(): validation failed, expected partition with key equal to _last_pkey {} due to _drop_partition_start being set, but got {}",
__FUNCTION__,
*_last_pkey,
ps.key());
// Reset drop flags and next pos if we are not continuing from the same partition
if (cmp_res < 0) {
// Close previous partition, we are not going to continue it.
push_mutation_fragment(*_schema, _permit, partition_end{});
_drop_partition_start = false;
_drop_static_row = false;
_next_position_in_partition = position_in_partition::for_partition_start();
_trim_range_tombstones = false;
}
} else { // should be a larger partition
require(
cmp_res < 0,
@@ -1308,14 +1293,9 @@ bool evictable_reader::should_drop_fragment(const mutation_fragment& mf) {
_drop_partition_start = false;
return true;
}
// Unlike partition-start above, a partition is not guaranteed to have a
// static row fragment. So reset the flag regardless of whether we could
// drop one or not.
// We are guaranteed to get here only right after dropping a partition-start,
// so if we are not seeing a static row here, the partition doesn't have one.
if (_drop_static_row) {
_drop_static_row = false;
return mf.is_static_row();
if (_drop_static_row && mf.is_static_row()) {
_drop_static_row = false;
return true;
}
return false;
}
@@ -2064,13 +2044,11 @@ public:
}
}
void abort(std::exception_ptr ep) {
_end_of_stream = true;
_ex = std::move(ep);
if (_full) {
_full->set_exception(_ex);
_full.reset();
} else if (_not_full) {
_not_full->set_exception(_ex);
_not_full.reset();
}
}
};

View File

@@ -1,52 +0,0 @@
/*
* Copyright (C) 2021 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "feed_writers.hh"
namespace mutation_writer {
bucket_writer::bucket_writer(schema_ptr schema, std::pair<flat_mutation_reader, queue_reader_handle> queue_reader, reader_consumer& consumer)
: _schema(schema)
, _handle(std::move(queue_reader.second))
, _consume_fut(consumer(std::move(queue_reader.first)))
{ }
bucket_writer::bucket_writer(schema_ptr schema, reader_permit permit, reader_consumer& consumer)
: bucket_writer(schema, make_queue_reader(schema, std::move(permit)), consumer)
{ }
future<> bucket_writer::consume(mutation_fragment mf) {
return _handle.push(std::move(mf));
}
void bucket_writer::consume_end_of_stream() {
_handle.push_end_of_stream();
}
void bucket_writer::abort(std::exception_ptr ep) noexcept {
_handle.abort(std::move(ep));
}
future<> bucket_writer::close() noexcept {
return std::move(_consume_fut);
}
} // mutation_writer

View File

@@ -22,31 +22,10 @@
#pragma once
#include "flat_mutation_reader.hh"
#include "mutation_reader.hh"
namespace mutation_writer {
using reader_consumer = noncopyable_function<future<> (flat_mutation_reader)>;
class bucket_writer {
schema_ptr _schema;
queue_reader_handle _handle;
future<> _consume_fut;
private:
bucket_writer(schema_ptr schema, std::pair<flat_mutation_reader, queue_reader_handle> queue_reader, reader_consumer& consumer);
public:
bucket_writer(schema_ptr schema, reader_permit permit, reader_consumer& consumer);
future<> consume(mutation_fragment mf);
void consume_end_of_stream();
void abort(std::exception_ptr ep) noexcept;
future<> close() noexcept;
};
template <typename Writer>
requires MutationFragmentConsumer<Writer, future<>>
future<> feed_writer(flat_mutation_reader&& rd, Writer&& wr) {
@@ -57,22 +36,8 @@ future<> feed_writer(flat_mutation_reader&& rd, Writer&& wr) {
auto f2 = rd.is_buffer_empty() ? rd.fill_buffer(db::no_timeout) : make_ready_future<>();
return when_all_succeed(std::move(f1), std::move(f2)).discard_result();
});
}).then_wrapped([&wr] (future<> f) {
if (f.failed()) {
auto ex = f.get_exception();
wr.abort(ex);
return wr.close().then_wrapped([ex = std::move(ex)] (future<> f) mutable {
if (f.failed()) {
// The consumer is expected to fail when aborted,
// so just ignore any exception.
(void)f.get_exception();
}
return make_exception_future<>(std::move(ex));
});
} else {
wr.consume_end_of_stream();
return wr.close();
}
}).finally([&wr] {
return wr.consume_end_of_stream();
});
});
}

View File

@@ -31,7 +31,33 @@
namespace mutation_writer {
class shard_based_splitting_mutation_writer {
using shard_writer = bucket_writer;
class shard_writer {
queue_reader_handle _handle;
future<> _consume_fut;
private:
shard_writer(schema_ptr schema, std::pair<flat_mutation_reader, queue_reader_handle> queue_reader, reader_consumer& consumer)
: _handle(std::move(queue_reader.second))
, _consume_fut(consumer(std::move(queue_reader.first))) {
}
public:
shard_writer(schema_ptr schema, reader_permit permit, reader_consumer& consumer)
: shard_writer(schema, make_queue_reader(schema, std::move(permit)), consumer) {
}
future<> consume(mutation_fragment mf) {
return _handle.push(std::move(mf));
}
future<> consume_end_of_stream() {
// consume_end_of_stream is always called from a finally block,
// and that's because we wait for _consume_fut to return. We
// don't want to generate another exception here if the read was
// aborted.
if (!_handle.is_terminated()) {
_handle.push_end_of_stream();
}
return std::move(_consume_fut);
}
};
private:
schema_ptr _schema;
@@ -76,23 +102,12 @@ public:
return write_to_shard(mutation_fragment(*_schema, _permit, std::move(pe)));
}
void consume_end_of_stream() {
for (auto& shard : _shards) {
if (shard) {
shard->consume_end_of_stream();
}
}
}
void abort(std::exception_ptr ep) {
for (auto&& shard : _shards) {
if (shard) {
shard->abort(ep);
}
}
}
future<> close() noexcept {
future<> consume_end_of_stream() {
return parallel_for_each(_shards, [] (std::optional<shard_writer>& shard) {
return shard ? shard->close() : make_ready_future<>();
if (!shard) {
return make_ready_future<>();
}
return shard->consume_end_of_stream();
});
}
};

View File

@@ -109,12 +109,22 @@ small_flat_map<Key, Value, Size>::find(const key_type& k) {
class timestamp_based_splitting_mutation_writer {
using bucket_id = int64_t;
class timestamp_bucket_writer : public bucket_writer {
class bucket_writer {
schema_ptr _schema;
queue_reader_handle _handle;
future<> _consume_fut;
bool _has_current_partition = false;
private:
bucket_writer(schema_ptr schema, std::pair<flat_mutation_reader, queue_reader_handle> queue_reader, reader_consumer& consumer)
: _schema(std::move(schema))
, _handle(std::move(queue_reader.second))
, _consume_fut(consumer(std::move(queue_reader.first))) {
}
public:
timestamp_bucket_writer(schema_ptr schema, reader_permit permit, reader_consumer& consumer)
: bucket_writer(schema, std::move(permit), consumer) {
bucket_writer(schema_ptr schema, reader_permit permit, reader_consumer& consumer)
: bucket_writer(schema, make_queue_reader(schema, std::move(permit)), consumer) {
}
void set_has_current_partition() {
_has_current_partition = true;
@@ -125,6 +135,15 @@ class timestamp_based_splitting_mutation_writer {
bool has_current_partition() const {
return _has_current_partition;
}
future<> consume(mutation_fragment mf) {
return _handle.push(std::move(mf));
}
future<> consume_end_of_stream() {
if (!_handle.is_terminated()) {
_handle.push_end_of_stream();
}
return std::move(_consume_fut);
}
};
private:
@@ -133,7 +152,7 @@ private:
classify_by_timestamp _classifier;
reader_consumer _consumer;
partition_start _current_partition_start;
std::unordered_map<bucket_id, timestamp_bucket_writer> _buckets;
std::unordered_map<bucket_id, bucket_writer> _buckets;
std::vector<bucket_id> _buckets_used_for_current_partition;
private:
@@ -164,19 +183,9 @@ public:
future<> consume(range_tombstone&& rt);
future<> consume(partition_end&& pe);
void consume_end_of_stream() {
for (auto& b : _buckets) {
b.second.consume_end_of_stream();
}
}
void abort(std::exception_ptr ep) {
for (auto&& b : _buckets) {
b.second.abort(ep);
}
}
future<> close() noexcept {
return parallel_for_each(_buckets, [] (std::pair<const bucket_id, timestamp_bucket_writer>& b) {
return b.second.close();
future<> consume_end_of_stream() {
return parallel_for_each(_buckets, [] (std::pair<const bucket_id, bucket_writer>& bucket) {
return bucket.second.consume_end_of_stream();
});
}
};

View File

@@ -205,10 +205,6 @@ public:
auto to_block = std::min(_used_memory - _blocked_bytes, n);
_blocked_bytes += to_block;
stop = (_limiter->update_and_check(to_block) && _stop_on_global_limit) || stop;
if (stop && !_short_read_allowed) {
// If we are here we stopped because of the global limit.
throw std::runtime_error("Maximum amount of memory for building query results is exhausted, unpaged query cannot be finished");
}
}
return stop;
}

View File

@@ -75,7 +75,7 @@ class reader_permit::impl : public boost::intrusive::list_base_hook<boost::intru
sstring _op_name;
std::string_view _op_name_view;
reader_resources _resources;
reader_permit::state _state = reader_permit::state::active;
reader_permit::state _state = reader_permit::state::registered;
public:
struct value_tag {};
@@ -123,17 +123,22 @@ public:
}
void on_admission() {
_state = reader_permit::state::active;
_state = reader_permit::state::admitted;
_semaphore.consume(_resources);
}
void consume(reader_resources res) {
_resources += res;
_semaphore.consume(res);
if (_state == reader_permit::state::admitted) {
_semaphore.consume(res);
}
}
void signal(reader_resources res) {
_resources -= res;
_semaphore.signal(res);
if (_state == reader_permit::state::admitted) {
_semaphore.signal(res);
}
}
reader_resources resources() const {
@@ -200,11 +205,14 @@ reader_resources reader_permit::consumed_resources() const {
std::ostream& operator<<(std::ostream& os, reader_permit::state s) {
switch (s) {
case reader_permit::state::registered:
os << "registered";
break;
case reader_permit::state::waiting:
os << "waiting";
break;
case reader_permit::state::active:
os << "active";
case reader_permit::state::admitted:
os << "admitted";
break;
}
return os;
@@ -241,7 +249,7 @@ struct permit_group_key_hash {
using permit_groups = std::unordered_map<permit_group_key, permit_stats, permit_group_key_hash>;
static permit_stats do_dump_reader_permit_diagnostics(std::ostream& os, const permit_groups& permits, reader_permit::state state) {
static permit_stats do_dump_reader_permit_diagnostics(std::ostream& os, const permit_groups& permits, reader_permit::state state, bool sort_by_memory) {
struct permit_summary {
const schema* s;
std::string_view op_name;
@@ -257,17 +265,25 @@ static permit_stats do_dump_reader_permit_diagnostics(std::ostream& os, const pe
}
}
std::ranges::sort(permit_summaries, [] (const permit_summary& a, const permit_summary& b) {
return a.memory < b.memory;
std::ranges::sort(permit_summaries, [sort_by_memory] (const permit_summary& a, const permit_summary& b) {
if (sort_by_memory) {
return a.memory < b.memory;
} else {
return a.count < b.count;
}
});
permit_stats total;
auto print_line = [&os] (auto col1, auto col2, auto col3) {
fmt::print(os, "{}\t{}\t{}\n", col2, col1, col3);
auto print_line = [&os, sort_by_memory] (auto col1, auto col2, auto col3) {
if (sort_by_memory) {
fmt::print(os, "{}\t{}\t{}\n", col2, col1, col3);
} else {
fmt::print(os, "{}\t{}\t{}\n", col1, col2, col3);
}
};
fmt::print(os, "Permits with state {}\n", state);
fmt::print(os, "Permits with state {}, sorted by {}\n", state, sort_by_memory ? "memory" : "count");
print_line("count", "memory", "name");
for (const auto& summary : permit_summaries) {
total.count += summary.count;
@@ -293,9 +309,11 @@ static void do_dump_reader_permit_diagnostics(std::ostream& os, const reader_con
permit_stats total;
fmt::print(os, "Semaphore {}: {}, dumping permit diagnostics:\n", semaphore.name(), problem);
total += do_dump_reader_permit_diagnostics(os, permits, reader_permit::state::active);
total += do_dump_reader_permit_diagnostics(os, permits, reader_permit::state::admitted, true);
fmt::print(os, "\n");
total += do_dump_reader_permit_diagnostics(os, permits, reader_permit::state::waiting);
total += do_dump_reader_permit_diagnostics(os, permits, reader_permit::state::waiting, false);
fmt::print(os, "\n");
total += do_dump_reader_permit_diagnostics(os, permits, reader_permit::state::registered, false);
fmt::print(os, "\n");
fmt::print(os, "Total: permits: {}, memory: {}\n", total.count, utils::to_hr_size(total.memory));
}
@@ -356,7 +374,7 @@ reader_concurrency_semaphore::~reader_concurrency_semaphore() {
reader_concurrency_semaphore::inactive_read_handle reader_concurrency_semaphore::register_inactive_read(std::unique_ptr<inactive_read> ir) {
// Implies _inactive_reads.empty(), we don't queue new readers before
// evicting all inactive reads.
if (_wait_list.empty() && _resources.memory > 0) {
if (_wait_list.empty()) {
const auto [it, _] = _inactive_reads.emplace(_next_id++, std::move(ir));
(void)_;
++_stats.inactive_reads;
@@ -406,13 +424,13 @@ bool reader_concurrency_semaphore::try_evict_one_inactive_read() {
}
bool reader_concurrency_semaphore::has_available_units(const resources& r) const {
// Special case: when there is no active reader (based on count) admit one
// regardless of availability of memory.
return (bool(_resources) && _resources >= r) || _resources.count == _initial_resources.count;
return bool(_resources) && _resources >= r;
}
bool reader_concurrency_semaphore::may_proceed(const resources& r) const {
return _wait_list.empty() && has_available_units(r);
// Special case: when there is no active reader (based on count) admit one
// regardless of availability of memory.
return _wait_list.empty() && (has_available_units(r) || _resources.count == _initial_resources.count);
}
future<reader_permit::resource_units> reader_concurrency_semaphore::do_wait_admission(reader_permit permit, size_t memory,
@@ -462,12 +480,6 @@ void reader_concurrency_semaphore::broken(std::exception_ptr ex) {
}
}
std::string reader_concurrency_semaphore::dump_diagnostics() const {
std::ostringstream os;
do_dump_reader_permit_diagnostics(os, *this, *_permit_list, "user request");
return os.str();
}
// A file that tracks the memory usage of buffers resulting from read
// operations.
class tracking_file_impl : public file_impl {

View File

@@ -231,6 +231,4 @@ public:
}
void broken(std::exception_ptr ex);
std::string dump_diagnostics() const;
};

View File

@@ -91,8 +91,9 @@ public:
class resource_units;
enum class state {
registered, // read is registered, but didn't attempt admission yet
waiting, // waiting for admission
active,
admitted,
};
class impl;

View File

@@ -309,7 +309,7 @@ float node_ops_metrics::repair_finished_percentage() {
tracker::tracker(size_t nr_shards, size_t max_repair_memory)
: _shutdown(false)
, _repairs(nr_shards) {
auto nr = std::max(size_t(1), size_t(max_repair_memory / max_repair_memory_per_range() / 4));
auto nr = std::max(size_t(1), size_t(max_repair_memory / max_repair_memory_per_range()));
rlogger.info("Setting max_repair_memory={}, max_repair_memory_per_range={}, max_repair_ranges_in_parallel={}",
max_repair_memory, max_repair_memory_per_range(), nr);
_range_parallelism_semaphores.reserve(nr_shards);

View File

@@ -571,7 +571,7 @@ public:
_mq[node_idx] = std::move(queue_handle);
auto writer = shared_from_this();
_writer_done[node_idx] = mutation_writer::distribute_reader_and_consume_on_shards(_schema, std::move(queue_reader),
[&db, reason = this->_reason, estimated_partitions = this->_estimated_partitions] (flat_mutation_reader reader) {
[&db, reason = this->_reason, estimated_partitions = this->_estimated_partitions, writer] (flat_mutation_reader reader) {
auto& t = db.local().find_column_family(reader.schema());
return db::view::check_needs_view_update_path(_sys_dist_ks->local(), t, reason).then([t = t.shared_from_this(), estimated_partitions, reader = std::move(reader)] (bool use_view_update_path) mutable {
//FIXME: for better estimations this should be transmitted from remote

View File

@@ -1263,9 +1263,7 @@ flat_mutation_reader cache_entry::read(row_cache& rc, read_context& reader, row_
// Assumes reader is in the corresponding partition
flat_mutation_reader cache_entry::do_read(row_cache& rc, read_context& reader) {
auto snp = _pe.read(rc._tracker.region(), rc._tracker.cleaner(), _schema, &rc._tracker, reader.phase());
auto ckr = with_linearized_managed_bytes([&] {
return query::clustering_key_filter_ranges::get_ranges(*_schema, reader.slice(), _key.key());
});
auto ckr = query::clustering_key_filter_ranges::get_ranges(*_schema, reader.slice(), _key.key());
auto r = make_cache_flat_mutation_reader(_schema, _key, std::move(ckr), rc, reader.shared_from_this(), std::move(snp));
r.upgrade_schema(rc.schema());
r.upgrade_schema(reader.schema());

View File

@@ -456,9 +456,6 @@ schema::schema(const schema& o)
rebuild();
if (o.is_view()) {
_view_info = std::make_unique<::view_info>(*this, o.view_info()->raw());
if (o.view_info()->base_info()) {
_view_info->set_base_info(o.view_info()->base_info());
}
}
}
@@ -862,7 +859,7 @@ std::ostream& schema::describe(database& db, std::ostream& os) const {
os << "}";
os << "\n AND comment = '" << comment()<< "'";
os << "\n AND compaction = {'class': '" << sstables::compaction_strategy::name(compaction_strategy()) << "'";
map_as_cql_param(os, compaction_strategy_options(), false) << "}";
map_as_cql_param(os, compaction_strategy_options()) << "}";
os << "\n AND compression = {";
map_as_cql_param(os, get_compressor_params().get_options());
os << "}";

View File

@@ -24,7 +24,6 @@
#include "schema_registry.hh"
#include "log.hh"
#include "db/schema_tables.hh"
#include "view_info.hh"
static logging::logger slogger("schema_registry");
@@ -275,43 +274,22 @@ global_schema_ptr::global_schema_ptr(global_schema_ptr&& o) noexcept {
assert(o._cpu_of_origin == current);
_ptr = std::move(o._ptr);
_cpu_of_origin = current;
_base_schema = std::move(o._base_schema);
}
schema_ptr global_schema_ptr::get() const {
if (this_shard_id() == _cpu_of_origin) {
return _ptr;
} else {
auto registered_schema = [](const schema_registry_entry& e) {
schema_ptr ret = local_schema_registry().get_or_null(e.version());
if (!ret) {
ret = local_schema_registry().get_or_load(e.version(), [&e](table_schema_version) {
return e.frozen();
});
}
return ret;
};
schema_ptr registered_bs;
// the following code contains registry entry dereference of a foreign shard
// however, it is guarantied to succeed since we made sure in the constructor
// that _bs_schema and _ptr will have a registry on the foreign shard where this
// object originated so as long as this object lives the registry entries lives too
// and it is safe to reference them on foreign shards.
if (_base_schema) {
registered_bs = registered_schema(*_base_schema->registry_entry());
if (_base_schema->registry_entry()->is_synced()) {
registered_bs->registry_entry()->mark_synced();
}
// 'e' points to a foreign entry, but we know it won't be evicted
// because _ptr is preventing this.
const schema_registry_entry& e = *_ptr->registry_entry();
schema_ptr s = local_schema_registry().get_or_null(e.version());
if (!s) {
s = local_schema_registry().get_or_load(e.version(), [&e](table_schema_version) {
return e.frozen();
});
}
schema_ptr s = registered_schema(*_ptr->registry_entry());
if (s->is_view()) {
if (!s->view_info()->base_info()) {
// we know that registered_bs is valid here because we make sure of it in the constructors.
s->view_info()->set_base_info(s->view_info()->make_base_dependent_view_info(*registered_bs));
}
}
if (_ptr->registry_entry()->is_synced()) {
if (e.is_synced()) {
s->registry_entry()->mark_synced();
}
return s;
@@ -319,33 +297,16 @@ schema_ptr global_schema_ptr::get() const {
}
global_schema_ptr::global_schema_ptr(const schema_ptr& ptr)
: _cpu_of_origin(this_shard_id()) {
// _ptr must always have an associated registry entry,
// if ptr doesn't, we need to load it into the registry.
auto ensure_registry_entry = [] (const schema_ptr& s) {
schema_registry_entry* e = s->registry_entry();
: _ptr([&ptr]() {
// _ptr must always have an associated registry entry,
// if ptr doesn't, we need to load it into the registry.
schema_registry_entry* e = ptr->registry_entry();
if (e) {
return s;
} else {
return local_schema_registry().get_or_load(s->version(), [&s] (table_schema_version) {
return frozen_schema(s);
return ptr;
}
return local_schema_registry().get_or_load(ptr->version(), [&ptr] (table_schema_version) {
return frozen_schema(ptr);
});
}
};
schema_ptr s = ensure_registry_entry(ptr);
if (s->is_view()) {
if (s->view_info()->base_info()) {
_base_schema = ensure_registry_entry(s->view_info()->base_info()->base_schema());
} else if (ptr->view_info()->base_info()) {
_base_schema = ensure_registry_entry(ptr->view_info()->base_info()->base_schema());
} else {
on_internal_error(slogger, format("Tried to build a global schema for view {}.{} with an uninitialized base info", s->ks_name(), s->cf_name()));
}
if (!s->view_info()->base_info() || !s->view_info()->base_info()->base_schema()->registry_entry()) {
s->view_info()->set_base_info(s->view_info()->make_base_dependent_view_info(*_base_schema));
}
}
_ptr = s;
}
}())
, _cpu_of_origin(this_shard_id())
{ }

View File

@@ -165,7 +165,6 @@ schema_registry& local_schema_registry();
// chain will last.
class global_schema_ptr {
schema_ptr _ptr;
schema_ptr _base_schema;
unsigned _cpu_of_origin;
public:
// Note: the schema_ptr must come from the current shard and can't be nullptr.

View File

@@ -154,7 +154,9 @@ ar.reloc_add('licenses')
ar.reloc_add('swagger-ui')
ar.reloc_add('api')
def exclude_submodules(tarinfo):
if tarinfo.name in ('scylla/tools/jmx', 'scylla/tools/java'):
if tarinfo.name in ('scylla/tools/jmx',
'scylla/tools/java',
'scylla/tools/python3'):
return None
return tarinfo
ar.reloc_add('tools', filter=exclude_submodules)

Submodule seastar updated: 5ef45afa4d...6973080cd1

View File

@@ -118,6 +118,7 @@ private:
private volatile String keyspace;
#endif
std::optional<auth::authenticated_user> _user;
std::optional<sstring> _driver_name, _driver_version;
auth_state _auth_state = auth_state::UNINITIALIZED;
@@ -147,6 +148,20 @@ public:
_auth_state = new_state;
}
std::optional<sstring> get_driver_name() const {
return _driver_name;
}
void set_driver_name(sstring driver_name) {
_driver_name = std::move(driver_name);
}
std::optional<sstring> get_driver_version() const {
return _driver_version;
}
void set_driver_version(sstring driver_version) {
_driver_version = std::move(driver_version);
}
client_state(external_tag, auth::service& auth_service, const socket_address& remote_address = socket_address(), bool thrift = false)
: _is_internal(false)
, _is_thrift(thrift)

View File

@@ -53,7 +53,6 @@
#include "database.hh"
#include "db/schema_tables.hh"
#include "types/user.hh"
#include "db/schema_tables.hh"
namespace service {
@@ -1097,19 +1096,8 @@ future<schema_ptr> get_schema_definition(table_schema_version v, netw::messaging
// referenced by the incoming request.
// That means the column mapping for the schema should always be inserted
// with TTL (refresh TTL in case column mapping already existed prior to that).
auto us = s.unfreeze(db::schema_ctxt(proxy));
// if this is a view - we might need to fix it's schema before registering it.
if (us->is_view()) {
auto& db = proxy.local().local_db();
schema_ptr base_schema = db.find_schema(us->view_info()->base_id());
auto fixed_view = db::schema_tables::maybe_fix_legacy_secondary_index_mv_schema(db, view_ptr(us), base_schema,
db::schema_tables::preserve_version::yes);
if (fixed_view) {
us = fixed_view;
}
}
return db::schema_tables::store_column_mapping(proxy, us, true).then([us] {
return frozen_schema{us};
return db::schema_tables::store_column_mapping(proxy, s.unfreeze(db::schema_ctxt(proxy)), true).then([s] {
return s;
});
});
}).then([] (schema_ptr s) {
@@ -1117,7 +1105,7 @@ future<schema_ptr> get_schema_definition(table_schema_version v, netw::messaging
// table.
if (s->is_view()) {
if (!s->view_info()->base_info()) {
auto& db = service::get_local_storage_proxy().local_db();
auto& db = service::get_local_storage_proxy().get_db().local();
// This line might throw a no_such_column_family
// It should be fine since if we tried to register a view for which
// we don't know the base table, our registry is broken.

View File

@@ -3624,11 +3624,6 @@ protected:
public:
virtual future<foreign_ptr<lw_shared_ptr<query::result>>> execute(storage_proxy::clock_type::time_point timeout) {
if (_targets.empty()) {
// We may have no targets to read from if a DC with zero replication is queried with LOCACL_QUORUM.
// Return an empty result in this case
return make_ready_future<foreign_ptr<lw_shared_ptr<query::result>>>(make_foreign(make_lw_shared(query::result())));
}
digest_resolver_ptr digest_resolver = ::make_shared<digest_read_resolver>(_schema, _cl, _block_for,
db::is_datacenter_local(_cl) ? db::count_local_endpoints(_targets): _targets.size(), timeout);
auto exec = shared_from_this();
@@ -4938,12 +4933,10 @@ void storage_proxy::init_messaging_service() {
tracing::trace(trace_state_ptr, "read_data: message received from /{}", src_addr.addr);
}
auto da = oda.value_or(query::digest_algorithm::MD5);
auto sp = get_local_shared_storage_proxy();
if (!cmd.max_result_size) {
auto& cfg = sp->_db.local().get_config();
cmd.max_result_size.emplace(cfg.max_memory_for_unlimited_query_soft_limit(), cfg.max_memory_for_unlimited_query_hard_limit());
cmd.max_result_size.emplace(cinfo.retrieve_auxiliary<uint64_t>("max_result_size"));
}
return do_with(std::move(pr), std::move(sp), std::move(trace_state_ptr), [&cinfo, cmd = make_lw_shared<query::read_command>(std::move(cmd)), src_addr = std::move(src_addr), da, t] (::compat::wrapping_partition_range& pr, shared_ptr<storage_proxy>& p, tracing::trace_state_ptr& trace_state_ptr) mutable {
return do_with(std::move(pr), get_local_shared_storage_proxy(), std::move(trace_state_ptr), [&cinfo, cmd = make_lw_shared<query::read_command>(std::move(cmd)), src_addr = std::move(src_addr), da, t] (::compat::wrapping_partition_range& pr, shared_ptr<storage_proxy>& p, tracing::trace_state_ptr& trace_state_ptr) mutable {
p->get_stats().replica_data_reads++;
auto src_ip = src_addr.addr;
return get_schema_for_read(cmd->schema_version, std::move(src_addr), p->_messaging).then([cmd, da, &pr, &p, &trace_state_ptr, t] (schema_ptr s) {

View File

@@ -446,12 +446,6 @@ public:
distributed<database>& get_db() {
return _db;
}
const database& local_db() const noexcept {
return _db.local();
}
database& local_db() noexcept {
return _db.local();
}
void set_cdc_service(cdc::cdc_service* cdc) {
_cdc = cdc;

View File

@@ -298,7 +298,7 @@ void storage_service::prepare_to_join(
_token_metadata.update_normal_tokens(my_tokens, get_broadcast_address());
_cdc_streams_ts = db::system_keyspace::get_saved_cdc_streams_timestamp().get0();
if (!_cdc_streams_ts) {
if (!_cdc_streams_ts && db().local().get_config().check_experimental(db::experimental_features_t::CDC)) {
// We could not have completed joining if we didn't generate and persist a CDC streams timestamp,
// unless we are restarting after upgrading from non-CDC supported version.
// In that case we won't begin a CDC generation: it should be done by one of the nodes
@@ -550,7 +550,7 @@ void storage_service::join_token_ring(int delay) {
assert(should_bootstrap() || db().local().is_replacing() || !_cdc_streams_ts);
}
if (!_cdc_streams_ts) {
if (!_cdc_streams_ts && db().local().get_config().check_experimental(db::experimental_features_t::CDC)) {
// If we didn't choose a CDC streams timestamp at this point, then either
// 1. we're replacing a node which didn't gossip a CDC streams timestamp for whatever reason,
// 2. we've already bootstrapped, but are upgrading from a non-CDC version,
@@ -570,15 +570,10 @@ void storage_service::join_token_ring(int delay) {
if (!db().local().is_replacing()
&& (!db::system_keyspace::bootstrap_complete()
|| cdc::should_propose_first_generation(get_broadcast_address(), _gossiper))) {
try {
_cdc_streams_ts = cdc::make_new_cdc_generation(db().local().get_config(),
_bootstrap_tokens, _token_metadata, _gossiper,
_sys_dist_ks.local(), get_ring_delay(), _for_testing);
} catch (...) {
cdc_log.warn(
"Could not create a new CDC generation: {}. This may make it impossible to use CDC. Use nodetool checkAndRepairCdcStreams to fix CDC generation",
std::current_exception());
}
_cdc_streams_ts = cdc::make_new_cdc_generation(db().local().get_config(),
_bootstrap_tokens, _token_metadata, _gossiper,
_sys_dist_ks.local(), get_ring_delay(), _for_testing);
}
}
@@ -898,18 +893,24 @@ void storage_service::bootstrap() {
// It doesn't hurt: other nodes will (potentially) just do more generation switches.
// We do this because with this new attempt at bootstrapping we picked a different set of tokens.
// Update pending ranges now, so we correctly count ourselves as a pending replica
// when inserting the new CDC generation.
_token_metadata.add_bootstrap_tokens(_bootstrap_tokens, get_broadcast_address());
update_pending_ranges().get();
if (db().local().get_config().check_experimental(db::experimental_features_t::CDC)) {
// Update pending ranges now, so we correctly count ourselves as a pending replica
// when inserting the new CDC generation.
_token_metadata.add_bootstrap_tokens(_bootstrap_tokens, get_broadcast_address());
update_pending_ranges().get();
// After we pick a generation timestamp, we start gossiping it, and we stick with it.
// We don't do any other generation switches (unless we crash before complecting bootstrap).
assert(!_cdc_streams_ts);
// After we pick a generation timestamp, we start gossiping it, and we stick with it.
// We don't do any other generation switches (unless we crash before complecting bootstrap).
assert(!_cdc_streams_ts);
_cdc_streams_ts = cdc::make_new_cdc_generation(db().local().get_config(),
_bootstrap_tokens, _token_metadata, _gossiper,
_sys_dist_ks.local(), get_ring_delay(), _for_testing);
_cdc_streams_ts = cdc::make_new_cdc_generation(db().local().get_config(),
_bootstrap_tokens, _token_metadata, _gossiper,
_sys_dist_ks.local(), get_ring_delay(), _for_testing);
} else {
// We should not be able to join the cluster if other nodes support CDC but we don't.
// The check should have been made somewhere in prepare_to_join (`check_knows_remote_features`).
assert(!_feature_service.cluster_supports_cdc());
}
_gossiper.add_local_application_state({
// Order is important: both the CDC streams timestamp and tokens must be known when a node handles our status.
@@ -2035,8 +2036,9 @@ future<> storage_service::start_gossiping(bind_messaging_port do_bind) {
return seastar::async([&ss, do_bind] {
if (!ss._initialized) {
slogger.warn("Starting gossip by operator request");
bool cdc_enabled = ss.db().local().get_config().check_experimental(db::experimental_features_t::CDC);
ss.set_gossip_tokens(db::system_keyspace::get_local_tokens().get0(),
std::make_optional(cdc::get_local_streams_timestamp().get0()));
cdc_enabled ? std::make_optional(cdc::get_local_streams_timestamp().get0()) : std::nullopt);
ss._gossiper.force_newer_generation();
ss._gossiper.start_gossiping(utils::get_generation_number(), gms::bind_messaging_port(bool(do_bind))).then([&ss] {
ss._initialized = true;
@@ -2336,7 +2338,7 @@ future<> storage_service::rebuild(sstring source_dc) {
slogger.info("Streaming for rebuild successful");
}).handle_exception([] (auto ep) {
// This is used exclusively through JMX, so log the full trace but only throw a simple RTE
slogger.warn("Error while rebuilding node: {}", ep);
slogger.warn("Error while rebuilding node: {}", std::current_exception());
return make_exception_future<>(std::move(ep));
});
});

View File

@@ -212,18 +212,16 @@ public:
};
struct compaction_writer {
shared_sstable sst;
// We use a ptr for pointer stability and so that it can be null
// when using a noop monitor.
sstable_writer writer;
// The order in here is important. A monitor must be destroyed before the writer it is monitoring since it has a
// periodic timer that checks the writer.
// The writer must be destroyed before the shared_sstable since the it may depend on the sstable
// (as in the mx::writer over compressed_file_data_sink_impl case that depends on sstables::compression).
std::unique_ptr<compaction_write_monitor> monitor;
shared_sstable sst;
compaction_writer(std::unique_ptr<compaction_write_monitor> monitor, sstable_writer writer, shared_sstable sst)
: sst(std::move(sst)), writer(std::move(writer)), monitor(std::move(monitor)) {}
: writer(std::move(writer)), monitor(std::move(monitor)), sst(std::move(sst)) {}
compaction_writer(sstable_writer writer, shared_sstable sst)
: compaction_writer(nullptr, std::move(writer), std::move(sst)) {}
};
@@ -438,6 +436,7 @@ protected:
mutation_source_metadata _ms_metadata = {};
garbage_collected_sstable_writer::data _gc_sstable_writer_data;
compaction_sstable_replacer_fn _replacer;
std::optional<compaction_weight_registration> _weight_registration;
utils::UUID _run_identifier;
::io_priority_class _io_priority;
// optional clone of sstable set to be used for expiration purposes, so it will be set if expiration is enabled.
@@ -456,6 +455,7 @@ protected:
, _sstable_level(descriptor.level)
, _gc_sstable_writer_data(*this)
, _replacer(std::move(descriptor.replacer))
, _weight_registration(std::move(descriptor.weight_registration))
, _run_identifier(descriptor.run_identifier)
, _io_priority(descriptor.io_priority)
, _sstable_set(std::move(descriptor.all_sstables_snapshot))
@@ -609,12 +609,10 @@ private:
std::move(gc_consumer));
return seastar::async([cfc = std::move(cfc), reader = std::move(reader), this] () mutable {
reader.consume_in_thread(std::move(cfc), db::no_timeout);
reader.consume_in_thread(std::move(cfc), make_partition_filter(), db::no_timeout);
});
});
// producer will filter out a partition before it reaches the consumer(s)
auto producer = make_filtering_reader(make_sstable_reader(), make_partition_filter());
return consumer(std::move(producer));
return consumer(make_sstable_reader());
}
virtual reader_consumer make_interposer_consumer(reader_consumer end_consumer) {
@@ -927,6 +925,9 @@ public:
}
virtual void on_end_of_compaction() override {
if (_weight_registration) {
_cf.get_compaction_manager().on_compaction_complete(*_weight_registration);
}
replace_remaining_exhausted_sstables();
}
private:

View File

@@ -134,6 +134,8 @@ struct compaction_descriptor {
uint64_t max_sstable_bytes;
// Run identifier of output sstables.
utils::UUID run_identifier;
// Holds ownership of a weight assigned to this compaction iff it's a regular one.
std::optional<compaction_weight_registration> weight_registration;
// Calls compaction manager's task for this compaction to release reference to exhausted sstables.
std::function<void(const std::vector<shared_sstable>& exhausted_sstables)> release_exhausted;
// The options passed down to the compaction code.

View File

@@ -311,7 +311,6 @@ future<> compaction_manager::run_custom_job(column_family* cf, sstring name, non
cmlog.info("{} was abruptly stopped, reason: {}", name, e.what());
} catch (...) {
cmlog.error("{} failed: {}", name, std::current_exception());
throw;
}
});
return task->compaction_done.get_future().then([task] {});
@@ -436,7 +435,7 @@ void compaction_manager::reevaluate_postponed_compactions() {
}
void compaction_manager::postpone_compaction_for_column_family(column_family* cf) {
_postponed.insert(cf);
_postponed.push_back(cf);
}
future<> compaction_manager::stop_ongoing_compactions(sstring reason) {
@@ -576,7 +575,7 @@ void compaction_manager::submit(column_family* cf) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
auto compacting = make_lw_shared<compacting_sstable_registration>(this, descriptor.sstables);
auto weight_r = compaction_weight_registration(this, weight);
descriptor.weight_registration = compaction_weight_registration(this, weight);
descriptor.release_exhausted = [compacting] (const std::vector<sstables::shared_sstable>& exhausted_sstables) {
compacting->release_compacting(exhausted_sstables);
};
@@ -586,7 +585,7 @@ void compaction_manager::submit(column_family* cf) {
_stats.pending_tasks--;
_stats.active_tasks++;
task->compaction_running = true;
return cf.run_compaction(std::move(descriptor)).then_wrapped([this, task, compacting = std::move(compacting), weight_r = std::move(weight_r)] (future<> f) mutable {
return cf.run_compaction(std::move(descriptor)).then_wrapped([this, task, compacting = std::move(compacting)] (future<> f) mutable {
_stats.active_tasks--;
task->compaction_running = false;
@@ -630,11 +629,10 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
_tasks.push_back(task);
auto sstables = std::make_unique<std::vector<sstables::shared_sstable>>(get_func(*cf));
auto compacting = make_lw_shared<compacting_sstable_registration>(this, *sstables);
auto sstables_ptr = sstables.get();
_stats.pending_tasks += sstables->size();
task->compaction_done = do_until([sstables_ptr] { return sstables_ptr->empty(); }, [this, task, options, sstables_ptr, compacting] () mutable {
task->compaction_done = do_until([sstables_ptr] { return sstables_ptr->empty(); }, [this, task, options, sstables_ptr] () mutable {
// FIXME: lock cf here
if (!can_proceed(task)) {
@@ -644,7 +642,7 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
auto sst = sstables_ptr->back();
sstables_ptr->pop_back();
return repeat([this, task, options, sst = std::move(sst), compacting] () mutable {
return repeat([this, task, options, sst = std::move(sst)] () mutable {
column_family& cf = *task->compacting_cf;
auto sstable_level = sst->get_sstable_level();
auto run_identifier = sst->run_identifier();
@@ -652,22 +650,21 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
auto descriptor = sstables::compaction_descriptor({ sst }, cf.get_sstable_set(), service::get_local_compaction_priority(),
sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, options);
auto compacting = make_lw_shared<compacting_sstable_registration>(this, descriptor.sstables);
// Releases reference to cleaned sstable such that respective used disk space can be freed.
descriptor.release_exhausted = [compacting] (const std::vector<sstables::shared_sstable>& exhausted_sstables) {
compacting->release_compacting(exhausted_sstables);
};
return with_semaphore(_rewrite_sstables_sem, 1, [this, task, &cf, descriptor = std::move(descriptor)] () mutable {
_stats.pending_tasks--;
_stats.active_tasks++;
task->compaction_running = true;
compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_compaction_controller.backlog_of_shares(200), _available_memory));
return do_with(std::move(user_initiated), [this, &cf, descriptor = std::move(descriptor)] (compaction_backlog_tracker& bt) mutable {
return with_scheduling_group(_scheduling_group, [this, &cf, descriptor = std::move(descriptor)]() mutable {
return cf.run_compaction(std::move(descriptor));
});
_stats.pending_tasks--;
_stats.active_tasks++;
task->compaction_running = true;
compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_compaction_controller.backlog_of_shares(200), _available_memory));
return do_with(std::move(user_initiated), [this, &cf, descriptor = std::move(descriptor)] (compaction_backlog_tracker& bt) mutable {
return with_scheduling_group(_scheduling_group, [this, &cf, descriptor = std::move(descriptor)] () mutable {
return cf.run_compaction(std::move(descriptor));
});
}).then_wrapped([this, task, compacting] (future<> f) mutable {
}).then_wrapped([this, task, compacting = std::move(compacting)] (future<> f) mutable {
task->compaction_running = false;
_stats.active_tasks--;
if (!can_proceed(task)) {
@@ -799,7 +796,7 @@ future<> compaction_manager::remove(column_family* cf) {
task->stopping = true;
}
}
_postponed.erase(cf);
_postponed.erase(boost::remove(_postponed, cf), _postponed.end());
// Wait for the termination of an ongoing compaction on cf, if any.
return do_for_each(*tasks_to_stop, [this, cf] (auto& task) {
@@ -835,6 +832,11 @@ void compaction_manager::stop_compaction(sstring type) {
}
}
void compaction_manager::on_compaction_complete(compaction_weight_registration& weight_registration) {
weight_registration.deregister();
reevaluate_postponed_compactions();
}
void compaction_manager::propagate_replacement(column_family* cf,
const std::vector<sstables::shared_sstable>& removed, const std::vector<sstables::shared_sstable>& added) {
for (auto& info : _compactions) {

View File

@@ -99,7 +99,7 @@ private:
future<> _waiting_reevalution = make_ready_future<>();
condition_variable _postponed_reevaluation;
// column families that wait for compaction but had its submission postponed due to ongoing compaction.
std::unordered_set<column_family*> _postponed;
std::vector<column_family*> _postponed;
// tracks taken weights of ongoing compactions, only one compaction per weight is allowed.
// weight is value assigned to a compaction job that is log base N of total size of all input sstables.
std::unordered_set<int> _weight_tracker;
@@ -111,7 +111,6 @@ private:
std::unordered_map<column_family*, rwlock> _compaction_locks;
semaphore _custom_job_sem{1};
seastar::named_semaphore _rewrite_sstables_sem = {1, named_semaphore_exception_factory{"rewrite sstables"}};
std::function<void()> compaction_submission_callback();
// all registered column families are submitted for compaction at a constant interval.
@@ -256,6 +255,11 @@ public:
// Stops ongoing compaction of a given type.
void stop_compaction(sstring type);
// Called by compaction procedure to release the weight lock assigned to it, such that
// another compaction waiting on same weight can start as soon as possible. That's usually
// called before compaction seals sstable and such and after all compaction work is done.
void on_compaction_complete(compaction_weight_registration& weight_registration);
double backlog() {
return _backlog_manager.backlog();
}

View File

@@ -367,7 +367,6 @@ class index_reader {
const io_priority_class& _pc;
tracing::trace_state_ptr _trace_state;
shared_index_lists _index_lists;
future<> _background_closes = make_ready_future<>();
struct reader {
index_consumer _consumer;
@@ -473,16 +472,6 @@ private:
};
return _index_lists.get_or_load(summary_idx, loader).then([this, &bound, summary_idx] (shared_index_lists::list_ptr ref) {
// to make sure list is not closed when another bound is still using it, index list will only be closed when there's only one owner holding it
if (bound.current_list && bound.current_list.use_count() == 1) {
// a new background close will only be initiated when previous ones terminate, so as to limit the concurrency.
_background_closes = _background_closes.then_wrapped([current_list = std::move(bound.current_list)] (future<>&& f) mutable {
f.ignore_ready_future();
return do_with(std::move(current_list), [] (shared_index_lists::list_ptr& current_list) mutable {
return close_index_list(current_list);
});
});
}
bound.current_list = std::move(ref);
bound.current_summary_idx = summary_idx;
bound.current_index_idx = 0;
@@ -852,8 +841,6 @@ public:
return close_index_list(_upper_bound->current_list);
}
return make_ready_future<>();
}).then([this] () mutable {
return std::move(_background_closes);
});
}
};

View File

@@ -315,8 +315,8 @@ void sstable_writer_k_l::write_collection(file_writer& out, const composite& clu
void sstable_writer_k_l::write_clustered_row(file_writer& out, const schema& schema, const clustering_row& clustered_row) {
auto clustering_key = composite::from_clustering_element(schema, clustered_row.key());
maybe_write_row_tombstone(out, clustering_key, clustered_row);
maybe_write_row_marker(out, schema, clustered_row.marker(), clustering_key);
maybe_write_row_tombstone(out, clustering_key, clustered_row);
_collector.update_min_max_components(clustered_row.key());

View File

@@ -147,7 +147,7 @@ leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input
unsigned overlapping_sstables = 0;
auto prev_last = dht::ring_position::min();
for (auto& sst : sstables) {
if (dht::ring_position(sst->get_first_decorated_key()).tri_compare(*schema, prev_last) <= 0) {
if (dht::ring_position(sst->get_first_decorated_key()).less_compare(*schema, prev_last)) {
overlapping_sstables++;
}
prev_last = dht::ring_position(sst->get_last_decorated_key());
@@ -178,7 +178,7 @@ leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input
unsigned max_filled_level = 0;
size_t offstrategy_threshold = (mode == reshape_mode::strict) ? std::max(schema->min_compaction_threshold(), 4) : std::max(schema->max_compaction_threshold(), 32);
size_t offstrategy_threshold = std::max(schema->min_compaction_threshold(), 4);
size_t max_sstables = std::max(schema->max_compaction_threshold(), int(offstrategy_threshold));
auto tolerance = [mode] (unsigned level) -> unsigned {
if (mode == reshape_mode::strict) {
@@ -189,8 +189,10 @@ leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input
};
if (level_info[0].size() > offstrategy_threshold) {
size_tiered_compaction_strategy stcs(_stcs_options);
return stcs.get_reshaping_job(std::move(level_info[0]), schema, iop, mode);
level_info[0].resize(std::min(level_info[0].size(), max_sstables));
compaction_descriptor desc(std::move(level_info[0]), std::optional<sstables::sstable_set>(), iop);
desc.options = compaction_options::make_reshape();
return desc;
}
for (unsigned level = leveled_manifest::MAX_LEVELS - 1; level > 0; --level) {

View File

@@ -378,7 +378,6 @@ private:
_fwd_end = _fwd ? position_in_partition::before_all_clustered_rows() : position_in_partition::after_all_clustered_rows();
_out_of_range = false;
_range_tombstones.reset();
_ready = {};
_first_row_encountered = false;
}
public:
@@ -1145,11 +1144,7 @@ public:
setup_for_partition(pk);
auto dk = dht::decorate_key(*_schema, pk);
_reader->on_next_partition(std::move(dk), tombstone(deltime));
// Only partition start will be consumed if processing a large run of partition tombstones,
// so let's stop the consumer if buffer is full.
// Otherwise, partition tombstones will keep accumulating in memory till other fragment type
// is found which can stop the consumer (perhaps there's none if sstable is full of tombstones).
return proceed(!_reader->is_buffer_full());
return proceed::yes;
}
virtual consumer_m::row_processing_result consume_row_start(const std::vector<temporary_buffer<char>>& ecp) override {

View File

@@ -256,7 +256,6 @@ size_tiered_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
bucket.resize(std::min(max_sstables, bucket.size()));
compaction_descriptor desc(std::move(bucket), std::optional<sstables::sstable_set>(), iop);
desc.options = compaction_options::make_reshape();
return desc;
}
}

View File

@@ -162,7 +162,7 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
for (auto& pair : all_buckets.first) {
auto ssts = std::move(pair.second);
if (ssts.size() > offstrategy_threshold) {
ssts.resize(std::min(ssts.size(), max_sstables));
ssts.resize(std::min(multi_window.size(), max_sstables));
compaction_descriptor desc(std::move(ssts), std::optional<sstables::sstable_set>(), iop);
desc.options = compaction_options::make_reshape();
return desc;

View File

@@ -101,8 +101,7 @@ class time_window_compaction_strategy : public compaction_strategy_impl {
time_window_compaction_strategy_options _options;
int64_t _estimated_remaining_tasks = 0;
db_clock::time_point _last_expired_check;
// As timestamp_type is an int64_t, a primitive type, it must be initialized here.
timestamp_type _highest_window_seen = 0;
timestamp_type _highest_window_seen;
// Keep track of all recent active windows that still need to be compacted into a single SSTable
std::unordered_set<timestamp_type> _recent_active_windows;
size_tiered_compaction_strategy_options _stcs_options;

View File

@@ -403,7 +403,7 @@ future<prepare_message> stream_session::prepare(std::vector<stream_request> requ
try {
db.find_column_family(ks, cf);
} catch (no_such_column_family&) {
auto err = format("[Stream #{{}}] prepare requested ks={{}} cf={{}} does not exist", plan_id, ks, cf);
auto err = format("[Stream #{{}}] prepare requested ks={{}} cf={{}} does not exist", ks, cf);
sslog.warn(err.c_str());
throw std::runtime_error(err);
}

View File

@@ -832,7 +832,7 @@ table::stop() {
return make_ready_future<>();
}
return _async_gate.close().then([this] {
return await_pending_ops().finally([this] {
return when_all(await_pending_writes(), await_pending_reads(), await_pending_streams()).discard_result().finally([this] {
return _memtables->request_flush().finally([this] {
return _compaction_manager.remove(this).then([this] {
// Nest, instead of using when_all, so we don't lose any exceptions.
@@ -852,33 +852,6 @@ table::stop() {
});
}
future<std::vector<sstables::entry_descriptor>>
table::reshuffle_sstables(std::set<int64_t> all_generations, int64_t start) {
struct work {
std::set<int64_t> all_generations; // Stores generation of all live sstables in the system.
work(int64_t start, std::set<int64_t> gens)
: all_generations(gens) {}
};
return do_with(work(start, std::move(all_generations)), [this] (work& work) {
tlogger.info("Reshuffling SSTables in {}...", _config.datadir);
return lister::scan_dir(_config.datadir, { directory_entry_type::regular }, [this, &work] (fs::path parent_dir, directory_entry de) {
auto comps = sstables::entry_descriptor::make_descriptor(parent_dir.native(), de.name);
if (comps.component != component_type::TOC) {
return make_ready_future<>();
}
// Skip generations that were already loaded by Scylla at a previous stage.
if (work.all_generations.contains(comps.generation)) {
return make_ready_future<>();
}
return make_exception_future<>(std::runtime_error("Loading SSTables from the main SSTable directory is unsafe and no longer supported."
" You will find a directory called upload/ inside the table directory that can be used to load new SSTables into the system"));
}, &sstables::manifest_json_filter).then([&work] {
return make_ready_future<std::vector<sstables::entry_descriptor>>();
});
});
}
void table::set_metrics() {
auto cf = column_family_label(_schema->cf_name());
auto ks = keyspace_label(_schema->ks_name());
@@ -1532,8 +1505,7 @@ future<std::unordered_map<sstring, table::snapshot_details>> table::get_snapshot
}
future<> table::flush() {
auto op = _pending_flushes_phaser.start();
return _memtables->request_flush().then([op = std::move(op)] {});
return _memtables->request_flush();
}
// FIXME: We can do much better than this in terms of cache management. Right
@@ -1551,10 +1523,6 @@ future<> table::flush_streaming_mutations(utils::UUID plan_id, dht::partition_ra
});
}
bool table::can_flush() const {
return _memtables->can_flush();
}
future<> table::clear() {
if (_commitlog) {
_commitlog->discard_completed_segments(_schema->id());

View File

@@ -80,7 +80,7 @@ def dynamodb(request):
verify = not request.config.getoption('https')
return boto3.resource('dynamodb', endpoint_url=local_url, verify=verify,
region_name='us-east-1', aws_access_key_id='alternator', aws_secret_access_key='secret_pass',
config=botocore.client.Config(retries={"max_attempts": 0}, read_timeout=300))
config=botocore.client.Config(retries={"max_attempts": 3}))
@pytest.fixture(scope="session")
def dynamodbstreams(request):

View File

@@ -86,7 +86,7 @@ ln -s "$SCYLLA" "$SCYLLA_LINK"
--alternator-write-isolation=always_use_lwt \
--alternator-streams-time-window-s=0 \
--developer-mode=1 \
--experimental-features=alternator-streams \
--experimental-features=cdc \
--ring-delay-ms 0 --collectd 0 \
--smp 2 -m 1G \
--overprovisioned --unsafe-bypass-fsync 1 \

Some files were not shown because too many files have changed in this diff Show More