Compare commits

...

152 Commits

Author SHA1 Message Date
Hagit Segev
b0d122f9c5 release: prepare for 3.1.3 2020-01-28 14:09:57 +02:00
Asias He
9a10e4a245 repair: Avoid duplicated partition_end write
Consider this:

1) Write partition_start of p1
2) Write clustering_row of p1
3) Write partition_end of p1
4) Repair is stopped due to error before writing partition_start of p2
5) Repair calls repair_row_level_stop() to tear down which calls
   wait_for_writer_done(). A duplicate partition_end is written.

To fix, track the partition_start and partition_end written, avoid
unpaired writes.

Backports: 3.1 and 3.2
Fixes: #5527
(cherry picked from commit 401854dbaf)
2020-01-21 13:39:19 +02:00
Piotr Sarna
871d1ebdd5 view: ignore duplicated key entries in progress virtual reader
Build progress virtual reader uses Scylla-specific
scylla_views_builds_in_progress table in order to represent
legacy views_builds_in_progress rows. The Scylla-specific table contains
additional cpu_id clustering key part, which is trimmed before returning
it to the user. That may cause duplicated clustering row fragments to be
emitted by the reader, which may cause undefined behaviour in consumers.
The solution is to keep track of previous clustering keys for each
partition and drop fragments that would cause duplication. That way if
any shard is still building a view, its progress will be returned,
and if many shards are still building, the returned value will indicate
the progress of a single arbitrary shard.

Fixes #4524
Tests:
unit(dev) + custom monotonicity checks from <tgrabiec@scylladb.com>

(cherry picked from commit 85a3a4b458)
2020-01-16 12:07:40 +01:00
Tomasz Grabiec
bff996959d cql: alter type: Format field name as text instead of hex
Fixes #4841

Message-Id: <1565702635-26214-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 64ff1b6405)
2020-01-05 18:51:53 +02:00
Gleb Natapov
1bdc83540b cache_hitrate_calculator: do not ignore a future returned from gossiper::add_local_application_state
We should wait for a future returned from add_local_application_state() to
resolve before issuing new calculation, otherwise two
add_local_application_state() may run simultaneously for the same state.

Fixes #4838.

Message-Id: <20190812082158.GE17984@scylladb.com>
(cherry picked from commit 00c4078af3)
2020-01-05 18:50:13 +02:00
Takuya ASADA
478c35e07a dist/debian: fix missing scyllatop files
Debian package build script does runs relocate_python_scripts.py for scyllatop,
but mistakenly forgetting to install tools/scyllatop/*.py.
We need install them by using scylla-server.install.

Fixes #5518

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191227025750.434407-1-syuu@scylladb.com>
2019-12-30 19:38:34 +02:00
Benny Halevy
ba968ab9ec tracing: one_session_records: keep local tracing ptr
Similar to trace_state keep shared_ptr<tracing> _local_tracing_ptr
in one_session_records when constructed so it can be used
during shutdown.

Fixes #5243

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 7aef39e400)
2019-12-24 18:42:21 +02:00
Avi Kivity
883b5e8395 database: fix schema use-after-move in make_multishard_streaming_reader
On aarch64, asan detected a use-after-move. It doesn't happen on x86_64,
likely due to different argument evaluation order.

Fix by evaluating full_slice before moving the schema.

Note: I used "auto&&" and "std::move()" even though full_slice()
returns a reference. I think this is safer in case full_slice()
changes, and works just as well with a reference.

Fixes #5419.

(cherry picked from commit 85822c7786)
2019-12-24 18:35:01 +02:00
Rafael Ávila de Espíndola
b47033676a types: recreate dependent user types.
In the system.types table a user type refers to another by name. When
a user type is modified, only its entry in the table is changed.

At runtime a user type has direct pointer to the types it uses. To
handle the discrepancy we need to recreate any dependent types when a
entry in system.types changes.

Fixes #5049

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
(cherry picked from commit 5af8b1e4a3)
2019-12-23 17:58:26 +02:00
Tomasz Grabiec
67e45b73f0 types: Fix abort on type alter which affects a compact storage table with no regular columns
Fixes #4837

Message-Id: <1565702247-23800-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 34cff6ed6b)
2019-12-23 17:34:06 +02:00
Rafael Ávila de Espíndola
37eac75b6f cql: Fix use of UDT in reversed columns
We were missing calls to underlying_type in a few locations and so the
insert would think the given literal was invalid and the select would
refuse to fetch a UDT field.

Fixes #4672

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190708200516.59841-1-espindola@scylladb.com>
(cherry picked from commit 4e7ffb80c0)
2019-12-23 15:54:36 +02:00
Piotr Sarna
e8431a3474 table: Reduce read amplification in view update generation
This commit makes sure that single-partition readers for
read-before-write do not have fast-forwarding enabled,
as it may lead to huge read amplification. The observed case was:
1. Creating an index.
  CREATE INDEX index1  ON myks2.standard1 ("C1");
2. Running cassandra-stress in order to generate view updates.
cassandra-stress write no-warmup n=1000000 cl=ONE -schema \
  'replication(factor=2) compaction(strategy=LeveledCompactionStrategy)' \
  keyspace=myks2 -pop seq=4000000..8000000 -rate threads=100 -errors
  skip-read-validation -node 127.0.0.1;

Without disabling fast-forwarding, single-partition readers
were turned into scanning readers in cache, which resulted
in reading 36GB (sic!) on a workload which generates less
than 1GB of view updates. After applying the fix, the number
dropped down to less than 1GB, as expected.

Refs #5409
Fixes #4615
Fixes #5418

(cherry picked from commit 79c3a508f4)
2019-12-05 22:36:20 +02:00
Yaron Kaikov
9d78d848e6 release: prepare for 3.1.2 2019-11-27 10:24:43 +02:00
Rafael Ávila de Espíndola
32aa6ddd7e commitlog: make sure a file is closed
If allocate or truncate throws, we have to close the file.

Fixes #4877

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191114174810.49004-1-espindola@scylladb.com>
(cherry picked from commit 6160b9017d)
2019-11-24 17:48:24 +02:00
Tomasz Grabiec
74cc9477af row_cache: Fix abort on bad_alloc during cache update
Since 90d6c0b, cache will abort when trying to detach partition
entries while they're updated. This should never happen. It can happen
though, when the update fails on bad_alloc, because the cleanup guard
invalidates the cache before it releases partition snapshots (held by
"update" coroutine).

Fix by destroying the coroutine first.

Fixes #5327.

Tests:
  - row_cache_test (dev)

Message-Id: <1574360259-10132-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit e3d025d014)
2019-11-24 17:44:07 +02:00
Nadav Har'El
95acf71680 merge: row_marker: correct row expiry condition
Merged patch set by Piotr Dulikowski:

This change corrects condition on which a row was considered expired by its
TTL.

The logic that decides when a row becomes expired was inconsistent with the
logic that decides if a single cell is expired. A single cell becomes expired
when expiry_timestamp <= now, while a row became expired when
expiry_timestamp < now (notice the strict inequality). For rows inserted
with TTL, this caused non-key cells to expire (change their values to null)
one second before the row disappeared. Now, row expiry logic uses non-strict
inequality.

Fixes #4263,
Fixes #5290.

Tests:

    unit(dev)
    python test described in issue #5290

(cherry picked from commit 9b9609c65b)
2019-11-20 21:40:11 +02:00
Asias He
921f8baf00 gossip: Fix max generation drift measure
Assume n1 and n2 in a cluster with generation number g1, g2. The
cluster runs for more than 1 year (MAX_GENERATION_DIFFERENCE). When n1
reboots with generation g1' which is time based, n2 will see
g1' > g2 + MAX_GENERATION_DIFFERENCE and reject n1's gossip update.

To fix, check the generation drift with generation value this node would
get if this node were restarted.

This is a backport of CASSANDRA-10969.

Fixes #5164

(cherry picked from commit 0a52ecb6df)
2019-11-20 11:39:16 +02:00
Avi Kivity
071d7d9210 reloc: do not install dependencies when building the relocatable package
The dependencies are provided by the frozen toolchain. If a dependency
is missing, we must update the toolchain rather than rely on build-time
installation, which is not reproducible (as different package versions
are available at different times).

Luckily "dnf install" does not update an already-installed package. Had
that been a case, none of our builds would have been reproducible, since
packages would be updated to the latest version as of the build time rather
than the version selected by the frozen toolchain.

So, to prevent missing packages in the frozen toolchain translating to
an unreproducible build, remove the support for installing dependencies
from reloc/build_reloc.sh. We still parse the --nodeps option in case some
script uses it.

Fixes #5222.

Tests: reloc/build_reloc.sh.
(cherry picked from commit cd075e9132)
2019-11-18 14:58:24 +02:00
Kamil Braun
769b9bbe59 view: fix bug in virtual columns.
When creating a virtual column of non-frozen map type,
the wrong type was used for the map's keys.

Fixes #5165.

(cherry picked from commit ef9d5750c8)
2019-11-18 14:55:17 +02:00
Benny Halevy
d4e553c153 sstables: delete_atomically: fix misplaced parenthesis in pending_delete_log warning message
Fixes #4861.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190818064637.9207-1-bhalevy@scylladb.com>
(cherry picked from commit 20083be9f6)
2019-11-17 17:57:00 +02:00
Avi Kivity
d983411488 build: adjust libthread_db file name to match gdb expectations
gdb searches for libthread_db.so using its canonical name of libthread_db.so.1 rather
than the file name of libthread_db-1.0.so, so use that name to store the file in the
archive.

Fixes #4996.

(cherry picked from commit d77171e10e)
2019-11-17 17:57:00 +02:00
Avi Kivity
27de1bb8e6 reconcilable_result: use chunked_vector to hold partitions
Usually, a reconcilable_result holds very few partitions (1 is common),
since the page size is limited by 1MB. But if we have paging disabled or
if we are reconciling a range full of tombstones, we may see many more.
This can cause large allocations.

Change to chunked_vector to prevent those large allocations, as they
can be quite expensive.

Fixes #4780.

(cherry picked from commit 093d2cd7e5)
2019-11-17 17:57:00 +02:00
Avi Kivity
854f8ccb40 utils::chunked_vector: add rbegin() and related iterators
Needed as an std::vector replacement.

(cherry picked from commit eaa9a5b0d7)

Prerequisite for #4780.
2019-11-17 17:57:00 +02:00
Avi Kivity
a68170c9a3 utils: chunked_vector: make begin()/end() const correct
begin() of a const vector should return a const_iterator, to avoid
giving the caller the ability to mutate it.

This slipped through since iterator's constructor does a const_cast.

Noticed by code inspection.

(cherry picked from commit df6faae980)

Prerequisite for #4780.
2019-11-17 17:57:00 +02:00
Glauber Costa
7e4bcf2c0f do not crash in user-defined operations if the controller is disabled
Scylla currently crashes if we run manual operations like nodetool
compact with the controller disabled. While we neither like nor
recommend running with the controller disabled, due to some corner cases
in the controller algorithm we are not yet at the point in which we can
deprecate this and are sometimes forced to disable it.

The reason for the crash is that manual operations will invoke
_backlog_of_shares, which returns what is the backlog needed to
create a certain number of shares. That scan the existing control
points, but when we run without the controller there are no control
points and we crash.

Backlog doesn't matter if the controller is disabled, and the return
value of this function will be immaterial in this case. So to avoid the
crash, we return something right away if the controller is disabled.

Fixes #5016

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit c9f2d1d105)
2019-11-17 12:33:23 +02:00
Avi Kivity
a74b3a182e Merge "Add proper aggregation for paged indexing" from Piotr
"
Fixes #4540

This series adds proper handling of aggregation for paged indexed queries.
Before this series returned results were presented to the user in per-page
partial manner, while they should have been returned as a single aggregated
value.

Tests: unit(dev)
"

* 'add_proper_aggregation_for_paged_indexing_for_3.1' of https://github.com/psarna/scylla:
  test: add 'eventually' block to index paging test
  tests: add indexing+paging test case for clustering keys
  tests: add indexing + paging + aggregation test case
  tests: add query_options to cquery_nofail
  cql3: make DEFAULT_COUNT_PAGE_SIZE constant public
  cql3: add proper aggregation to paged indexing
  cql3: add a query options constructor with explicit page size
  cql3: enable explicit copying of query_options
  cql3: split execute_base_query implementation
2019-11-17 12:23:30 +02:00
Piotr Sarna
e9bc579565 test: add 'eventually' block to index paging test
Without 'eventually', the test is flaky because the index can still
be not up to date while checking its conditions.

Fixes #4670

(cherry picked from commit ebbe038d19)
2019-11-15 09:12:40 +01:00
Piotr Sarna
ad46bf06a7 tests: add indexing+paging test case for clustering keys
Indexing a non-prefix part of the clustering key has a separate
code path (see issue #3405), so it deserves a separate test case.
2019-11-14 10:26:49 +01:00
Piotr Sarna
1ff21a28b7 tests: add indexing + paging + aggregation test case
Indexed queries used to erroneously return partial per-page results
for aggregation queries. This test case used to reproduce the problem
and now ensures that there would be no regressions.

Refs #4540
2019-11-14 10:26:49 +01:00
Piotr Sarna
fb3dfaa736 tests: add query_options to cquery_nofail
The cquery_nofail utility is extended, so it can accept custom
query options, just like execute_cql does.
2019-11-14 10:26:49 +01:00
Piotr Sarna
5a02e6976f cql3: make DEFAULT_COUNT_PAGE_SIZE constant public
The constant will be later used in test scenarios.
2019-11-14 10:26:49 +01:00
Piotr Sarna
5202eea7a7 cql3: add proper aggregation to paged indexing
Aggregated and paged filtering needs to aggregate the results
from all pages in order to avoid returning partial per-page
results. It's a little bit more complicated than regular aggregation,
because each paging state needs to be translated between the base
table and the underlying view. The routine keeps fetching pages
from the underlying view, which are then used to fetch base rows,
which go straight to the result set builder.

Fixes #4540
2019-11-14 10:26:48 +01:00
Gleb Natapov
038733f1a5 storage_proxy: do not release mutation if not all replies were received
MV backpressure code frees mutation for delayed client replies earlier
to save memory. The commit 2d7c026d6e that
introduced the logic claimed to do it only when all replies are received,
but this is not the case. Fix the code to free only when all replies
are received for real.

Fixes #5242

Message-Id: <20191113142117.GA14484@scylladb.com>
(cherry picked from commit 552c56633e)
2019-11-14 11:04:27 +02:00
Piotr Sarna
0ed2e90925 cql3: add a query options constructor with explicit page size
For internal use, there already exists a query_options constructor
that copies data from another query_options with overwritten paging
state. This commit adds an option to overwrite page size as well.
2019-11-14 09:58:35 +01:00
Piotr Sarna
9ee6d2bc15 cql3: enable explicit copying of query_options 2019-11-14 09:58:28 +01:00
Piotr Sarna
23582a2ce9 cql3: split execute_base_query implementation
In order to handle aggregation queries correctly, the function that
returns base query results is split into two, so it's possible to
access raw query results, before they're converted into end-user
CQL message.
2019-11-14 09:58:05 +01:00
Takuya ASADA
5ddf0ec1df dist/common/scripts/scylla_setup: don't proceed with empty NIC name
Currently NIC selection prompt on scylla_setup just proceed setup when
user just pressed Enter key on the prompt.
The prompt should ask NIC name again until user input correct NIC name.

Fixes #4517
Message-Id: <20190617124925.11559-1-syuu@scylladb.com>

(cherry picked from commit 7320c966bc)
2019-11-13 17:27:21 +02:00
Avi Kivity
e6eb54af90 Update seastar submodule
* seastar 75488f6ef2...cfc082207c (2):
  > core: fix a race in execution stages
  > execution_stage: prevent unbounded growth

Fixes #4749.
Fixes #4856.
2019-11-13 13:14:27 +02:00
Piotr Sarna
f5a869966a view: fix view_info select statement for local indexes
Calculating the select statement for given view_info structure
used to work fine, but once local indexes were introduced, a subtle
bug appeared: the legacy token column does not exist in local indexes
and a valid clustering key column was omitted instead.
That results in potentially incorrect partition slices being used later
in read-before-write.
There's a long term plan for removing select_statement from
view info altogether, but nonetheless the bug needs to be fixed first.

cherry picked from commit 9e98b51aaa

Fixes #5241
Message-Id: <cb2e863e8e993e00ec7329505f737a9ce4b752ae.1572432826.git.sarna@scylladb.com>
2019-11-01 08:06:30 +02:00
Piotr Sarna
0c70cd626b index: add is_global_index() utility
The helper function is useful for determining if given schema
represents a global index.

cherry picked from commit 2ee8c6f595
Message-Id: <db5c9383e426fb2e55e5dbeebc7b8127afc91158.1572432826.git.sarna@scylladb.com>
2019-11-01 08:06:25 +02:00
Botond Dénes
0928aa4791 repair: repair_cf_range(): extract result of local checksum calculation only once
The loop that collects the result of the checksum calculations and logs
any errors. The error logging includes `checksums[0]` which corresponds
to the checksum calculation on the local node. This violates the
assumption of the code following the loop, which assumes that the future
of `checksums[0]` is intact after the loop terminates. However this is
only true when the checksum calculation is successful and is false when
it fails, as in this case the loop extracts the error and logs it. When
the code after the loop checks again whether said calculation failed, it
will get a false negative and will go ahead and attempt to extract the
value, triggering an assert failure.
Fix by making sure that even in the case of failed checksum calculation,
the result of `checksum[0]` is extracted only once.

Fixes: #5238
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191029151709.90986-1-bdenes@scylladb.com>
(cherry picked from commit e48f301e95)
2019-10-29 20:43:30 +02:00
Yaron Kaikov
f32ec885c4 release: prepare for 3.1.1 2019-10-24 21:55:50 +03:00
Tomasz Grabiec
762eec2bc6 Merge "Fix TTL serialization breakage" from Avi
ommit 93270dd changed gc_clock to be 64-bit, to fix the Y2038
problem. While 64-bit tombstone::deletion_time is serialized in a
compatible way, TTLs (gc_clock::duration) were not.

This patchset reverts TTL serialization to the 32-bit serialization
format, and also allows opting-in to the 64-bit format in case a
cluster was installed with the broken code. Only Scylla 3.1.0 is
vulnerable.

Fixes #4855

Tests: unit (dev)
(cherry picked from commit e621db591e)
2019-10-24 08:55:34 +03:00
Avi Kivity
3f4d9f210f Merge "Fix handling of schema alters and eviction in cache" from Tomasz
"
Fixes #5134, Eviction concurrent with preempted partition entry update after
  memtable flush may allow stale data to be populated into cache.

Fixes #5135, Cache reads may miss some writes if schema alter followed by a
  read happened concurrently with preempted partition entry update.

Fixes #5127, Cache populating read concurrent with schema alter may use the
  wrong schema version to interpret sstable data.

Fixes #5128, Reads of multi-row partitions concurrent with memtable flush may
  fail or cause a node crash after schema alter.
"

* tag 'fix-cache-issues-with-schema-alter-and-eviction-v2' of github.com:tgrabiec/scylla:
  tests: row_cache: Introduce test_alter_then_preempted_update_then_memtable_read
  tests: row_cache_stress_test: Verify all entries are evictable at the end
  tests: row_cache_stress_test: Exercise single-partition reads
  tests: row_cache_stress_test: Add periodic schema alters
  tests: memtable_snapshot_source: Allow changing the schema
  tests: simple_schema: Prepare for schema altering
  row_cache: Record upgraded schema in memtable entries during update
  memtable: Extract memtable_entry::upgrade_schema()
  row_cache, mvcc: Prevent locked snapshots from being evicted
  row_cache: Make evict() not use invalidate_unwrapped()
  mvcc: Introduce partition_snapshot::touch()
  row_cache, mvcc: Do not upgrade schema of entries which are being updated
  row_cache: Use the correct schema version to populate the partition entry
  delegating_reader: Optimize fill_buffer()
  row_cache, memtable: Use upgrade_schema()
  flat_mutation_reader: Introduce upgrade_schema()

(cherry picked from commit 8ed6f94a16)
2019-10-18 13:59:40 +02:00
Yaron Kaikov
9c3cdded9e release: prepare for 3.1.0 2019-10-12 08:45:49 +03:00
yaronkaikov
05272c53ed release: prepare for 3.1.0.rc9 2019-10-06 10:51:37 +03:00
Botond Dénes
393b2abdc9 querier_cache: correctly account entries evicted on insertion in the population
Currently, the population stat is not increased for entries that are
evicted immediately on insert, however the code that does the eviction
still decreases the population stat, leading to an imbalance and in some
cases the underflow of the population stat. To fix, unconditionally
increase the population stat upon inserting an entry, regardless of
whether it is immediately evicted or not.

Fixes: #5123

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191001153215.82997-1-bdenes@scylladb.com>
(cherry picked from commit 00b432b61d)
2019-10-05 12:36:21 +03:00
Avi Kivity
d9dc8f92cc Merge " hinted handoff: fix races during shutdown and draining" from Vlad
"
Fix races that may lead to use-after-free events and file system level exceptions
during shutdown and drain.

The root cause of use-after-free events in question is that space_watchdog blocks on
end_point_hints_manager::file_update_mutex() and we need to make sure this mutex is alive as long as
it's accessed even if the corresponding end_point_hints_manager instance
is destroyed in the context of manager::drain_for().

File system exceptions may occur when space_watchdog attempts to scan a
directory while it's being deleted from the drain_for() context.
In case of such an exception new hints generation is going to be blocked
- including for materialized views, till the next space_watchdog round (in 1s).

Issues that are fixed are #4685 and #4836.

Tested as follows:
 1) Patched the code in order to trigger the race with (a lot) higher
    probability and running slightly modified hinted handoff replace
    dtest with a debug binary for 100 times. Side effect of this
    testing was discovering of #4836.
 2) Using the same patch as above tested that there are no crashes and
    nodes survive stop/start sequences (they were not without this series)
    in the context of all hinted handoff dtests. Ran the whole set of
    tests with dev binary for 10 times.
"

Fixes #4685
Fixes #4836

* 'hinted_handoff_race_between_drain_for_and_space_watchdog_no_global_lock-v2' of https://github.com/vladzcloudius/scylla:
  hinted handoff: fix a race on a directory removal between space_watchdog and drain_for()
  hinted handoff: make taking file_update_mutex safe
  db::hints::manager::drain_for(): fix alignment
  db::hints::manager: serialize calls to drain_for()
  db::hints: cosmetics: identation and missing method qualifier

(cherry picked from commit 3cb081eb84)
2019-10-05 09:50:05 +03:00
Gleb Natapov
c009f7b182 messaging_service: enable reuseaddr on messaging service rpc
Fixes #4943

Message-Id: <20190918152405.GV21540@scylladb.com>
(cherry picked from commit 73e3d0a283)
2019-10-03 14:42:38 +03:00
Avi Kivity
303a56f2bd Update seastar submodule
* seastar 7dfcf334c4...75488f6ef2 (2):
  > net: socket::{set,get}_reuseaddr() should not be virtual
  > Merge "fix some tcp connection bugs and add reuseaddr option to a client socket" from Gleb

Prerequisite for #4943.
2019-10-03 14:41:34 +03:00
Tomasz Grabiec
57512d3df9 db: read: Filter-out sstables using its first and last keys
Affects single-partition reads only.

Refs #5113

When executing a query on the replica we do several things in order to
narrow down the sstable set we read from.

For tables which use LeveledCompactionStrategy, we store sstables in
an interval set and we select only sstables whose partition ranges
overlap with the queried range. Other compaction strategies don't
organize the sstables and will select all sstables at this stage. The
reasoning behind this is that for non-LCS compaction strategies the
sstables' ranges will typically overlap and using interval sets in
this case would not be effective and would result in quadratic (in
sstable count) memory consumption.

The assumption for overlap does not hold if the sstables come from
repair or streaming, which generates non-overlapping sstables.

At a later stage, for single-partition queries, we use the sstables'
bloom filter (kept in memory) to drop sstables which surely don't
contain given partition. Then we proceed to sstable indexes to narrow
down the data file range.

Tables which don't use LCS will do unnecessary I/O to read index pages
for single-partition reads if the partition is outside of the
sstable's range and the bloom filter is ineffective (Refs #5112).

This patch fixes the problem by consulting sstable's partition range
in addition to the bloom filter, so that the non-overlapping sstables
will be filtered out with certainty and not depend on bloom filter's
efficiency.

It's also faster to drop sstables based on the keys than the bloom
filter.

Tests:
  - unit (dev)
  - manual using cqlsh

Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190927122505.21932-1-tgrabiec@scylladb.com>
(cherry picked from commit b0e0f29b06)
2019-09-29 10:57:58 +03:00
Tomasz Grabiec
a894868298 sstables: Fix partition key count estimation for a range
The method sstable::estimated_keys_for_range() was severely
under-estimating the number of partitions in an sstable for a given
token range.

The first reason is that it underestimated the number of sstable index
pages covered by the range, by one. In extreme, if the requested range
falls into a single index page, we will assume 0 pages, and report 1
partition. The reason is that we were using
get_sample_indexes_for_range(), which returns entries with the keys
falling into the range, not entries for pages which may contain the
keys.

A single page can have a lot of partitions though. By default, there
is a 1:20000 ratio between summary entry size and the data file size
covered by it. If partitions are small, that can be many hundreds of
partitions.

Another reason is that we underestimate the number of partitions in an
index page. We multiply the number of pages by:

   (downsampling::BASE_SAMPLING_LEVEL * _components->summary.header.min_index_interval)
     / _components->summary.header.sampling_level

Using defaults, that means multiplying by 128. In the cassandra-stress
workload a single partition takes about 300 bytes in the data file and
summary entry is 22 bytes. That means a single page covers 22 * 20'000
= 440'000 bytes of the data file, which contains about 1'466
partitions. So we underestimate by an order of magnitude.

Underestimating the number of partitions will result in too small
bloom filters being generated for the sstables which are the output of
repair or streaming. This will make the bloom filters ineffective
which results in reads selecting more sstables than necessary.

The fix is to base the estimation on the number of index pages which
may contain keys for the range, and multiply that by the average key
count per index page.

Fixes #5112.
Refs #4994.

The output of test_key_count_estimation:

Before:

count = 10000
est = 10112
est([-inf; +inf]) = 512
est([0; 0]) = 128
est([0; 63]) = 128
est([0; 255]) = 128
est([0; 511]) = 128
est([0; 1023]) = 128
est([0; 4095]) = 256
est([0; 9999]) = 512
est([5000; 5000]) = 1
est([5000; 5063]) = 1
est([5000; 5255]) = 1
est([5000; 5511]) = 1
est([5000; 6023]) = 128
est([5000; 9095]) = 256
est([5000; 9999]) = 256
est(non-overlapping to the left) = 1
est(non-overlapping to the right) = 1

After:

count = 10000
est = 10112
est([-inf; +inf]) = 10112
est([0; 0]) = 2528
est([0; 63]) = 2528
est([0; 255]) = 2528
est([0; 511]) = 2528
est([0; 1023]) = 2528
est([0; 4095]) = 5056
est([0; 9999]) = 10112
est([5000; 5000]) = 2528
est([5000; 5063]) = 2528
est([5000; 5255]) = 2528
est([5000; 5511]) = 2528
est([5000; 6023]) = 5056
est([5000; 9095]) = 7584
est([5000; 9999]) = 7584
est(non-overlapping to the left) = 0
est(non-overlapping to the right) = 0

Tests:
  - unit (dev)

Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190927141339.31315-1-tgrabiec@scylladb.com>
(cherry picked from commit b93cc21a94)
2019-09-28 22:12:04 +03:00
Raphael S. Carvalho
a5d385d702 sstables/compaction_manager: Don't perform upgrade on shared SSTables
compaction_manager::perform_sstable_upgrade() fails when it feeds
compaction mechanism with shared sstables. Shared sstables should
be ignored when performing upgrade and so wait for reshard to pick
them up in parallel. Whenever a shared sstable is brought up either
on restart or via refresh, reshard procedure kicks in.
Reshard picks the highest supported format so the upgrade for
shared sstable will naturally take place.

Fixes #5056.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190925042414.4330-1-raphaelsc@scylladb.com>
(cherry picked from commit 571fa94eb5)
2019-09-28 17:39:09 +03:00
Avi Kivity
6413063b1b Merge "mvcc: Fix incorrect schema version being used to copy the mutation when applying (#5099)" from Tomasz
"
Currently affects only counter tables.

Introduced in 27014a2.

mutation_partition(s, mp) is incorrect because it uses s to interpret
mp, while it should use mp_schema.

We may hit this if the current node has a newer schema than the
incoming mutation. This can happen during table schema altering when we receive the
mutation from a node which hasn't processed the schema change yet.

This is undefined behavior in general. If the alter was adding or
removing columns, this may result in corruption of the write where
values of one column are inserted into a different column.

Fixes #5095.
"

* 'fix-schema-alter-counter-tables' of https://github.com/tgrabiec/scylla:
  mvcc: Fix incorrect schema verison being used to copy the mutation when applying
  mutation_partition: Track and validate schema version in debug builds
  tests: Use the correct schema to access mutation_partition

(cherry picked from commit 83bc59a89f)
2019-09-28 17:38:04 +03:00
Tomasz Grabiec
0d31c6da62 Merge "storage_proxy: tolerate view_update_write_response_handler id not found on shutdown" from Benny
1. Add assert in remove_response_handler to make crashes like in #5032 easier to understand.
2. Lookup the view_update_write_response_handler id before calling  timeout_cb and tolerate it not found.
   Just log a warning if this happened.

Fixes #5032

(cherry picked from commit 06b9818e98)
2019-09-28 17:37:40 +03:00
Tomasz Grabiec
b62bb036ed Merge "toppartitions: don't transport schema_ptr across shards" from Avi
When the toppartitions operation gathers results, it copies partition
keys with their schema_ptr:s. When these schema_ptr:s are copies
or destroyed, they can cause leaks or premature frees of the schema
in its original shard since reference count operations in are not atomic.

Fix that by converting the schema_ptr to a global_schema_ptr during
transportation.

Fixes #5104 (direct bug)
Fixes #5018 (schema prematurely freed, toppartitions previously executed on that node)
Fixes #4973 (corrupted memory pool of the same size class as schema, toppartitions previously executed on that node)

Tests: new test added that fails with the existing code in debug mode,
manual toppartitions test

(cherry picked from commit 5b0e48f25b)
2019-09-28 17:35:19 +03:00
Glauber Costa
bdabd2e7a4 toppartitions: fix typo
toppartitons -> toppartitions

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190627160937.7842-1-glauber@scylladb.com>
(cherry picked from commit d916601ea4)

Ref #5104 (prerequisite for patch)
2019-09-28 17:34:24 +03:00
Benny Halevy
d7fc7bcf9f commitlog: descriptor: skip leading path from filename
std::regex_match of the leading path may run out of stack
with long paths in debug build.

Using rfind instead to lookup the last '/' in in pathname
and skip it if found.

Fixes #4464

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190505144133.4333-1-bhalevy@scylladb.com>
(cherry picked from commit d9136f96f3)
2019-09-23 11:29:26 +03:00
Hagit Segev
21aec9c7ef release: prepare for 3.1.0.rc8 2019-09-23 07:01:02 +03:00
Asias He
02ce19e851 storage_service: Replicate and advertise tokens early in the boot up process
When a node is restarted, there is a race between gossip starts (other
nodes will mark this node up again and send requests) and the tokens are
replicated to other shards. Here is an example:

- n1, n2
- n2 is down, n1 think n2 is down
- n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends
  reads/writes to n2, but n2 hasn't replicated the token_metadata to all
  the shards.
- n2 complains:
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error
  (sorted_tokens is empty in first_token_index!)

The code path looks like below:

0 stoarge_service::init_server
1    prepare_to_join()
2          add gossip application state of NET_VERSION, SCHEMA and so on.
3         _gossiper.start_gossiping().get()
4    join_token_ring()
5           _token_metadata.update_normal_tokens(tokens, get_broadcast_address());
6           replicate_to_all_cores().get()
7           storage_service::set_gossip_tokens() which adds the gossip application state of TOKENS and STATUS

The race talked above is at line 3 and line 6.

To fix, we can replicate the token_metadata early after it is filled
with the tokens read from system table before gossip starts. So that
when other nodes think this restarting node is up, the tokens are
already replicated to all the shards.

In addition, this patch also fixes the issue that other nodes might see
a node miss the TOKENS and STATUS application state in gossip if that
node failed in the middle of a restarting process, i.e., it is killed
after line 3 and before line 7. As a result we could not replace the
node.

Tests: update_cluster_layout_tests.py
Fixes: #4709
Fixes: #4723
(cherry picked from commit 3b39a59135)
2019-09-22 12:45:22 +03:00
Eliran Sinvani
37c4be5e74 Storage proxy: protect against infinite recursion in query_partition_key_range_concurrent
A recent fix to #3767 limited the amount of ranges that
can return from query_ranges_to_vnodes_generator. This with
the combination of a large amount of token ranges can lead to
an infinite recursion. The algorithm multiplies by factor of
2 (actualy a shift left by one)  the amount of requested
tokens in each recursion iteration. As long as the requested
number of ranges is greater than 0, the recursion is implicit,
and each call is scheduled separately since the call is inside
a continuation of a map reduce.
But if the amount of iterations is large enough (~32) the
counter for requested ranges zeros out and from that moment on
two things will happen:
1. The counter will remain 0 forever (0*2 == 0)
2. The map reduce future will be immediately available and this
will result in the continuation being invoked immediately.
The latter causes the recursive call to be a "regular" recursive call
thus, through the stack and not the task queue of the scheduler, and
the former causes this recursion to be infinite.
The combination creates a stack that keeps growing and eventually
overflows resulting in undefined behavior (due to memory overrun).

This patch prevent the problem from happening, it limits the growth of
the concurrency counter beyond twice the last amount of tokens returned
by the query_ranges_to_vnodes_generator.And also makes sure it is not
get stuck at zero.

Testing: * Unit test in dev mode.
         * Modified add 50 dtest that reproduce the problem

Fixes #4944

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190922072838.14957-1-eliransin@scylladb.com>
(cherry picked from commit 280715ad45)
2019-09-22 11:59:19 +03:00
Avi Kivity
d81ac93728 Update seastar submodule
* seastar b314eb21b1...7dfcf334c4 (1):
  > iotune: fix exception handling in case test file creation fails

Fixes #5001.
2019-09-18 18:36:13 +03:00
Tomasz Grabiec
024d1563ad Revert "Simplify db::cql_type_parser::parse"
This reverts commit 7f64a6ec4b.

Fixes #5011

The reverted commit exposes #3760 for all schemas, not only those
which have UDTs.

The problem is that table schema deserialization now requires keyspace
to be present. If the replica hasn't received schema changes which
introduce the keyspace yet, the write will fail.

(cherry picked from commit 8517eecc28)
2019-09-12 20:17:39 +03:00
yaronkaikov
4a1a281e84 release: prepare for3.1.0.rc7 2019-09-11 15:15:38 +03:00
Piotr Sarna
d61dd1a933 main: make sure view_builder doesn't propagate semaphore errors
Stopping services which occurs in a destructor of deferred_action
should not throw, or it will end the program with
terminate(). View builder breaks a semaphore during its shutdown,
which results in propagating a broken_semaphore exception,
which in turn results in throwing an exception during stop().get().
In order to fix that issue, semaphore exceptions are explicitly
ignored, since they're expected to appear during shutdown.

Fixes #4875
Fixes #4995.

(cherry picked from commit 23c891923e)
2019-09-10 16:34:46 +03:00
Gleb Natapov
447c1e3bcc messaging_service: configure different streaming domain for each rpc server
A streaming domain identifies a server across shards. Each server should
have different one.

Fixes: #4953

Message-Id: <20190908085327.GR21540@scylladb.com>
(cherry picked from commit 9e9f64d90e)
2019-09-09 20:36:11 +03:00
Botond Dénes
834b92b3d7 stream_session: STREAM_MUTATION_FRAGMENTS: print errors in receive and distribute phase
Currently when an error happens during the receive and distribute phase
it is swallowed and we just return a -1 status to the remote. We only
log errors that happen during responding with the status. This means
that when streaming fails, we only know that something went wrong, but
the node on which the failure happened doesn't log anything.

Fix by also logging errors happening in the receive and distribute
phase. Also mention the phase in which the error happened in both error
log messages.

Refs: #4901
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190903115735.49915-1-bdenes@scylladb.com>
(cherry picked from commit 783277fb02)
2019-09-09 14:34:33 +03:00
Hagit Segev
2ec036f50c release: prepare for3.1.0.rc6 2019-09-08 10:32:22 +03:00
Avi Kivity
958fe2024f Update seastar submodule
* seastar c59d019d6b...b314eb21b1 (2):
  > reactor: fix false positives in the stall detector due to large task queue
  > reactor: remove unused _tasks_processed variable

Ref #4955, #4951, #4899, #4898.
2019-09-05 14:39:53 +03:00
Rafael Ávila de Espíndola
cd998b949a sstable: close file_writer if an exception in thrown
The previous code was not exception safe and would eventually cause a
file to be destroyed without being closed, causing an assert failure.

Unfortunately it doesn't seem to be possible to test this without
error injection, since using an invalid directory fails before this
code is executed.

Fixes #4948

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190904002314.79591-1-espindola@scylladb.com>
(cherry picked from commit 000514e7cc)
2019-09-05 10:14:36 +03:00
Avi Kivity
2e1e1392ea storage_proxy: protect _view_update_handlers_list iterators from invalidation
on_down() iterates over _view_update_handlers_list, but it yields during iteration,
and while it yields, elements in that list can be removed, resulting in a
use-after-free.

Prevent this by registering iterators that can be potentially invalidated, and
any time we remove an element from the list, check whether we're removing an element
that is being pointed to by a live iterator. If that is the case, advance the iterator
so that it points at a valid element (or at the end of the list).

Fixes #4912.

Tests: unit (dev)
(cherry picked from commit 301246f6c0)
2019-09-05 09:42:00 +03:00
yaronkaikov
623ea5e3d9 release: prepare for3.1.0.rc5 2019-09-02 14:42:47 +03:00
Avi Kivity
f92a7ca2bf tools: toolchain: fix dbuild in interactive mode regression
Before ede1d248af, running "tools/toolchain/dbuild -it -- bash" was
a nice way to play in the toolchain environment, for example to start
a debugger. But that commit caused containers to run in detached mode,
which is incompatible with interactive mode.

To restore the old behavior, detect that the user wants interactive mode,
and run the container in non-detached mode instead. Add the --rm flag
so the container is removed after execution (as it was before ede1d248af).

Fixes #4930.

Message-Id: <20190506175942.27361-1-avi@scylladb.com>

(cherry picked from commit db536776d9)
2019-08-29 18:33:44 +03:00
Tomasz Grabiec
d70c2db09c service: Announce the new schema version when features are enabled
Introduced in c96ee98.

We call update_schema_version() after features are enabled and we
recalculate the schema version. This method is not updating gossip
though. The node will still use it's database::version() to decide on
syncing, so it will not sync and stay inconsistent in gossip until the
next schema change.

We should call updatE_schema_version_and_announce() instead so that
the gossip state is also updated.

There is no actual schema inconsistency, but the joining node will
think there is and will wait indefinitely. Making a random schema
change would unbock it.

Fixes #4647.

Message-Id: <1566825684-18000-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit ac5ff4994a)
2019-08-27 08:35:58 +03:00
Paweł Dziepak
e4a39ed319 mutation_partition: verify row::append_cell() precondition
row::append_cell() has a precondition that the new cell column id needs
to be larger than that of any other already existing cell. If this
precondition is violated the row will end up in an invalid state. This
patch adds assertion to make sure we fail early in such cases.

(cherry picked from commit 060e3f8ac2)
2019-08-23 15:05:59 +02:00
Hagit Segev
bb70b9ed56 release: prepare for 3.1.0.rc4 2019-08-22 21:12:42 +03:00
Avi Kivity
e06e795031 Merge "database: assign proper io priority for streaming view updates" from Piotr
"
Streamed view updates parasitized on writing io priority, which is
reserved for user writes - it's now properly bound to streaming
write priority.

Verified manually by checking appropriate io metrics: scylla_io_queue_total_bytes{class="streaming_write" ...} vs scylla_io_queue_total_bytes{class="query" ...}

Tests: unit(dev)
"

Fixes #4615.

* 'assign_proper_io_priority_to_streaming_view_updates' of https://github.com/psarna/scylla:
  db,view: wrap view update generation in stream scheduling group
  database: assign proper io priority for streaming view updates

(cherry picked from commit 2c7435418a)
2019-08-22 16:20:19 +03:00
Piotr Sarna
7d56e8e5bb storage_proxy: fix iterator liveness issue in on_down (#4876)
The loop over view update handlers used a guard in order to ensure
that the object is not prematurely destroyed (thus invalidating
the iterator), but the guard itself was not in the right scope.
Fixed by replacinga 'for' loop with a 'while' loop, which moves
the iterator incrementation inside the scope in which it's still
guarded and valid.

Fixes #4866

(cherry picked from commit 526f4c42aa)
2019-08-21 19:04:56 +03:00
Avi Kivity
417250607b relocatable: switch from run-time relocation to install-time relocation
Our current relocation works by invoking the dynamic linker with the
executable as an argument. This confuses gdb since the kernel records
the dynamic linker as the executable, not the real executable.

Switch to install-time relocation with patchelf: when installing the
executable and libraries, all paths are known, and we can update the
path to the dynamic loader and to the dynamic libraries.

Since patchelf itself is dynamically linked, we have to relocate it
dynamically (with the old method of invoking it via the dynamic linker).
This is okay since it's a one-time operation and since we don't expect
to debug core dumps of patchelf crashes.

We lose the ability to run scylla directly from the uninstalled
tarball, but since the nonroot installer is already moving in the
direction of requiring install.sh, that is not a great loss, and
certainly the ability to debug is more important.

dh_strip barfs on some binaries which were treated with patchelf,
so exclude them from dh_strip. This doesn't lose any functionality,
since these binaries didn't have debug information to begin with
(they are already-stripped Fedora executables).

Fixes #4673.

(cherry-picked from commit 698b72b501)

Backport notes:
 - 3.1 doesn't call install.sh from the debian packager, so add an adjust_bin
   and call it from the debian rules file directly
 - adjusted install.sh for 3.1 prefix (/usr) compared to master prefix (/opt/scylladb)
2019-08-20 17:08:49 +03:00
Pekka Enberg
d06bcef3b7 Merge "docker: relax permission checks" from Avi
"Commit e3f7fe4 added file owner validation to prevent Scylla from
 crashing when it tries to touch a file it doesn't own. However, under
 docker, we cannot expect to pass this check since user IDs are from
 different namespaces: the process runs in a container namespace, but the
 data files usually come from a mounted volume, and so their uids are
 from the host namespace.

 So we need to relax the check. We do this by reverting b1226fb, which
 causes Scylla to run as euid 0 in docker, and by special-casing euid 0
 in the ownership verification step.

 Fixes #4823."

* 'docker-euid-0' of git://github.com/avikivity/scylla:
  main: relax file ownership checks if running under euid 0
  Revert "dist/docker/redhat: change user of scylla services to 'scylla'"

(cherry picked from commit 595434a554)
2019-08-14 08:31:10 +03:00
Tomasz Grabiec
50c5cb6861 Merge "Multishard combining reader more robust reader recreation" from Botond
Make the reader recreation logic more robust, by moving away from
deciding which fragments have to be dropped based on a bunch of
special cases, instead replacing this with a general logic which just
drops all already seen fragments (based on their position).  Special
handling is added for the case when the last position is a range
tombstone with a non full prefix starting position.  Reproducer unit
tests are added for both cases.

Refs #4695
Fixes #4733

(cherry picked from commit 0cf4fab2ca)
2019-08-14 08:30:53 +03:00
Kamil Braun
70f5154109 Fix command line argument parsing in main.
Command line arguments are parsed twice in Scylla: once in main and once in Seastar's app_template::run.
The first parse is there to check if the "--version" flag is present --- in this case the version is printed
and the program exists.  The second parsing is correct; however, most of the arguments were improperly treated
as positional arguments during the first parsing (e.g., "--network host" would treat "host" as a positional argument).
This happened because the arguments weren't known to the command line parser.
This commit fixes the issue by moving the parsing code until after the arguments are registered.
Resolves #4141.

Signed-off-by: Kamil Braun <kbraun@scylladb.com>
(cherry picked from commit f155a2d334)
2019-08-13 20:11:26 +03:00
Rafael Ávila de Espíndola
329c419c30 Always close commitlog files
We were using segment::_closed to decide whether _file was already
closed. Unfortunately they are not exactly the same thing. As far as
I understand it, segments can be closed and reused without actually
closing the file.

Found with a seastar patch that asserts on destroying an open
append_challenged_posix_file_impl.

Fixes #4745.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190721171332.7995-1-espindola@scylladb.com>
(cherry picked from commit 636e2470b1)
2019-08-13 19:59:13 +03:00
Avi Kivity
062d43c76e Merge "Unbreak the Unbreakable Linux" from Glauber
"
scylla_setup is currently broken for OEL. This happens because the
OS detection code checks for RHEL and Fedora. CentOS returns itself
as RHEL, but OEL does not.
"

Fixes #4842.

* 'unbreakable' of github.com:glommer/scylla:
  scylla_setup: be nicer about unrecognized OS
  scylla_util: recognize OEL as part of the RHEL family

(cherry picked from commit 1cf72b39a5)
2019-08-13 16:52:05 +03:00
Avi Kivity
cf4c238b28 Merge "Catch unclosed partition sstable write #4794" from Tomasz
"
Not emitting partition_end for a partition is incorrect. SStable
writer assumes that it is emitted. If it's not, the sstable will not
be written correctly. The partition index entry for the last partition
will be left partially written, which will result in errors during
reads. Also, statistics and sstable key ranges will not include the
last partition.

It's better to catch this problem at the time of writing, and not
generate bad sstables.

Another way of handling this would be to implicitly generate a
partition_end, but I don't think that we should do this. We cannot
trust the mutation stream when invariants are violated, we don't know
if this was really the last partition which was supposed to be
written. So it's safer to fail the write.

Enabled for both mc and la/ka.

Passing --abort-on-internal-error on the command line will switch to
aborting instead of throwing an exception.

The reason we don't abort by default is that it may bring the whole
cluster down and cause unavailability, while it may not be necessary
to do so. It's safer to fail just the affected operation,
e.g. repair. However, failing the operation with an exception leaves
little information for debugging the root cause. So the idea is that the
user would enable aborts on only one of the nodes in the cluster to
get a core dump and not bring the whole cluster down.
"

* 'catch-unclosed-partition-sstable-write' of https://github.com/tgrabiec/scylla:
  sstables: writer: Validate that partition is closed when the input mutation stream ends
  config, exceptions: Add helper for handling internal errors
  utils: config_file: Introduce named_value::observe()

(cherry picked from commit 95c0804731)
2019-08-08 13:13:42 +02:00
Amnon Heiman
20090c1992 init: do not allow replace-address for seeds
If a node is a seed node, it can not be started with
replace-address-first-boot or the replace-address flag.

The issue is that as a seed node it will generate new tokens instead of
replacing the existing one the user expect it to replaec when supplying
the flags.

This patch will throw a bad_configuration_error exception
in this case.

Fixes #3889

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 399d79fc6f)
2019-08-07 22:04:58 +03:00
Raphael S. Carvalho
8ffb567474 table: do not rely on undefined behavior in cleanup_sstables
It shouldn't rely on argument evaluation order, which is ub.

Fixes #4718.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 0e732ed1cf)
2019-08-07 21:48:44 +03:00
Tomasz Grabiec
710ec83d12 Merge "Fix the system.size_estimates table" from Kamil
Fixes a segfault when querying for an empty keyspace.

Also, fixes an infinite loop on smp > 1. Queries to
system.size_estimates table which are not single-partition queries
caused Scylla to go into an infinite loop inside
multishard_combining_reader::fill_buffer. This happened because
multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for
size_estimates_mutation_reader.

Fixes #4689.

(cherry picked from commit 14700c2ac4)
2019-08-07 21:38:38 +03:00
Asias He
8d7c489436 streaming: Move stream_mutation_fragments_cmd to a new file (#4812)
Avoid including the lengthy stream_session.hh in messaging_service.

More importantly, fix the build because currently messaging_service.cc
and messaging_service.hh does not include stream_mutation_fragments_cmd.
I am not sure why it builds on my machine. Spotted this when backporting
the "streaming: Send error code from the sender to receiver" to 3.0
branch.

Refs: #4789
(cherry picked from commit 49a73aa2fc)
2019-08-07 19:11:51 +02:00
Asias He
6ec558e3a0 streaming: Send error code from the sender to receiver
In case of error on the sender side, the sender does not propagate the
error to the receiver. The sender will close the stream. As a result,
the receiver will get nullopt from the source in
get_next_mutation_fragment and pass mutation_fragment_opt with no value
to the generating_reader. In turn, the generating_reader generates end
of stream. However, the last element that the generating_reader has
generated can be any type of mutation_fragment. This makes the sstable
that consumes the generating_reader violates the mutation_fragment
stream rule.

To fix, we need to propagate the error. However RPC streaming does not
support propagate the error in the framework. User has to send an error
code explicitly.

Fixes: #4789
(cherry picked from commit bac987e32a)
(cherry picked from commit 288371ce75)
2019-08-07 19:11:33 +02:00
Tomasz Grabiec
b1e2842c8c sstables: ka/la: reader: Make sure push_ready_fragments() does not miss to emit partition_end
Currently, if there is a fragment in _ready and _out_of_range was set
after row end was consumer, push_ready_fragments() would return
without emitting partition_end.

This is problematic once we make consume_row_start() emit
partiton_start directly, because we will want to assume that all
fragments for the previous partition are emitted by then. If they're
not, then we'd emit partition_start before partition_end for the
previous partition. The fix is to make sure that
push_ready_fragments() emits everything.

Fixes #4786

(cherry picked from commit 9b8ac5ecbc)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-08-01 13:06:08 +03:00
Avi Kivity
5a273737e3 Update seastar submodule
* seastar 82637dcab4...c59d019d6b (1):
  > reactor: fix deadlock of stall detector vs dlopen

Fixes #4759.
2019-07-31 18:31:40 +03:00
Avi Kivity
b0d2312623 toppartitions: fix race between listener removal and reads
Data listener reads are implemented as flat_mutation_readers, which
take a reference to the listener and then execute asynchronously.
The listener can be removed between the time when the reference is
taken and actual execution, resulting in a dangling pointer
dereference.

Fix by using a weak_ptr to avoid writing to a destroyed object. Note that writes
don't need protection because they execute atomically.

Fixes #4661.

Tests: unit (dev)
(cherry picked from commit e03c7003f1)
2019-07-28 13:53:40 +03:00
Avi Kivity
2f007d8e6b sstable: index_reader: close index_reader::reader more robustly
If we had an error while reading, then we would have failed to close
the reader, which in turn can cause memory corruption. Make the
closing more robust by using then_wrapped (that doesn't skip on
exception) and log the error for analysis.

Fixes #4761.

(cherry picked from commit b272db368f)
2019-07-27 18:19:57 +03:00
yaronkaikov
bebfd7b26c release: prepare for 3.1.0.rc3 2019-07-25 12:15:55 +03:00
Tomasz Grabiec
03b48b2caf database: Add missing partition slicing on streaming reader recreation
streaming_reader_lifecycle_policy::create_reader() was ignoring the
partition_slice passed to it and always creating the reader for the
full slice.

That's wrong because create_reader() is called when recreating a
reader after it's evicted. If the reader stopped in the middle of
partition we need to start from that point. Otherwise, fragments in
the mutation stream will appear duplicated or out of ordre, violating
assumptions of the consumers.

This was observed to result in repair writing incorrect sstables with
duplicated clustering rows, which results in
malformed_sstable_exception on read from those sstables.

Fixes #4659.

In v2:

  - Added an overload without partition_slice to avoid changing existing users which never slice

Tests:

  - unit (dev)
  - manual (3 node ccm + repair)

Backport: 3.1
Reviewd-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1563451506-8871-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 7604980d63)
2019-07-22 15:08:09 +03:00
Avi Kivity
95362624bc Merge "Fix disable_sstable_write synchronization with on_compaction_completion" from Benny
"
disable_sstable_write needs to acquire _sstable_deletion_sem to properly synchronize
with background deletions done by on_compaction_completion to ensure no sstables will
be created or deleted during reshuffle_sstables after
storage_service::load_new_sstables disables sstable writes.

Fixes #4622

Test: unit(dev), nodetool_additional_test.py migration_test.py
"

* 'scylla-4622-fix-disable-sstable-write' of https://github.com/bhalevy/scylla:
  table: document _sstables_lock/_sstable_deletion_sem locking order
  table: disable_sstable_write: acquire _sstable_deletion_sem
  table: uninline enable_sstable_write
  table: reshuffle_sstables: add log message

(cherry picked from commit 43690ecbdf)
2019-07-22 13:47:25 +03:00
Asias He
7865c314a5 repair: Avoid deadlock in remove_repair_meta
Start n1, n2
Create ks with rf = 2
Run repair on n2
Stop n2 in the middle of repair
n1 will notice n2 is DOWN, gossip handler will remove repair instance
with n2 which calls remove_repair_meta().

Inside remove_repair_meta(), we have:

```
1        return parallel_for_each(*repair_metas, [repair_metas] (auto& rm) {
2            return rm->stop();
3        }).then([repair_metas, from] {
4            rlogger.debug("Removed all repair_meta for single node {}", from);
5        });
```

Since 3.1, we start 16 repair instances in parallel which will create 16
readers.The reader semaphore is 10.

At line 2, it calls

```
6    future<> stop() {
7       auto gate_future = _gate.close();
8       auto writer_future = _repair_writer.wait_for_writer_done();
9       return when_all_succeed(std::move(gate_future), std::move(writer_future));
10    }
```

The gate protects the reader to read data from disk:

```
11 with_gate(_gate, [] {
12   read_rows_from_disk
13        return _repair_reader.read_mutation_fragment() --> calls reader() to read data
14 })
```

So line 7 won't return until all the 16 readers return from the call of
reader().

The problem is, the reader won't release the reader semaphore until the
reader is destroyed!
So, even if 10 out of the 16 readers have finished reading, they won't
release the semaphore. As a result, the stop() hangs forever.

To fix in short term, we can delete the reader, aka, drop the the
repair_meta object once it is stopped.

Refs: #4693
(cherry picked from commit 8774adb9d0)
2019-07-21 13:31:08 +03:00
Asias He
0e6b62244c streaming: Do not open rpc stream connection if ranges are not relevant to a shard
Given a list of ranges to stream, stream_transfer_task will create an
reader with the ranges and create a rpc stream connection on all the shards.

When user provides ranges to repair with -st -et options, e.g.,
using scylla-manger, such ranges can belong to only one shard, repair
will pass such ranges to streaming.

As a result, only one shard will have data to send while the rpc stream
connections are created on all the shards, which can cause the kernel
run out of ports in some systems.

To mitigate the problem, do not open the connection if the ranges do not
belong to the shard at all.

Refs: #4708
(cherry picked from commit 64a4c0ede2)
2019-07-21 10:23:49 +03:00
Kamil Braun
9d722a56b3 Fix timestamp_type_impl::timestamp_from_string.
Now it accepts the 'z' or 'Z' timezone, denoting UTC+00:00.
Fixes #4641.

Signed-off-by: Kamil Braun <kbraun@scylladb.com>
(cherry picked from commit 4417e78125)
2019-07-17 21:54:47 +03:00
Eliran Sinvani
7009d5fb23 auth: Prevent race between role_manager and pasword_authenticator
When scylla is started for the first time with PasswordAuthenticator
enabled, it can be that a record of the default superuser
will be created in the table with the can_login and is_superuser
set to null. It happens because the module in charge of creating
the row is the role manger and the module in charge of setting the
default password salted hash value is the password authenticator.
Those two modules are started together, it the case when the
password authenticator finish the initialization first, in the
period until the role manager completes it initialization, the row
contains those null columns and any loging attempt in this period
will cause a memory access violation since those columns are not
expected to ever be null. This patch removes the race by starting
the password authenticator and autorizer only after the role manger
finished its initialization.

Tests:
  1. Unit tests (release)
  2. Auth and cqlsh auth related dtests.

Fixes #4226

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190714124839.8392-1-eliransin@scylladb.com>
(cherry picked from commit 997a146c7f)
2019-07-15 21:18:05 +03:00
Takuya ASADA
eb49fae020 reloc: provide libthread_db.so.1 to debug thread on gdb
In scylla-debuginfo package, we have /usr/lib/debug/opt/scylladb/libreloc/libthread_db-1.0.so-666.development-0.20190711.73a1978fb.el7.x86_64.debug
but we actually does not have libthread_db.so.1 in /opt/scylladb/libreloc
since it's not available on ldd result with scylla binary.

To debug thread, we need to add the library in a relocatable package manually.

Fixes #4673

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190711111058.7454-1-syuu@scylladb.com>
(cherry picked from commit 842f75d066)
2019-07-15 14:50:56 +03:00
Asias He
92bf928170 repair: Allow repair when a replica is down
Since commit bb56653 (repair: Sync schema from follower nodes before
repair), the behaviour of handling down node during repair has been
changed.  That is, if a repair follower is down, it will fail to sync
schema with it and the repair of the range will be skipped. This means
a range can not be repaired unless all the nodes for the replicas are up.

To fix, we filter out the nodes that is down and mark the repair is
partial and repair with the nodes that are still up.

Tests: repair_additional_test:RepairAdditionalTest.repair_with_down_nodes_2b_test
Fixes: #4616
Backports: 3.1

Message-Id: <621572af40335cf5ad222c149345281e669f7116.1562568434.git.asias@scylladb.com>
(cherry picked from commit 39ca044dab)
2019-07-11 11:44:49 +03:00
Rafael Ávila de Espíndola
deac0b0e94 mc writer: Fix exception safety when closing _index_writer
This fixes a possible cause of #4614.

From the backtrace in that issue, it looks like a file is being closed
twice. The first point in the backtrace where that seems likely is in
the MC writer.

My first idea was to add a writer::close and make it the responsibility
of the code using the writer to call it. That way we would move work
out of the destructor.

That is a bit hard since the writer is destroyed from
flat_mutation_reader::impl::~consumer_adapter and that would need to
get a close function too.

This patch instead just fixes an exception safety issue. If
_index_writer->close() throws, _index_writer is still valid and
~writer will try to close it again.

If the exception was thrown after _completed.set_value(), that would
explain the assert about _completed.set_value() being called twice.

With this patch the path outside of the destructor now moves the
writer to a local variable before trying to close it.

Fixes #4614
Message-Id: <20190710171747.27337-1-espindola@scylladb.com>

(cherry picked from commit 281f3a69f8)
2019-07-11 11:42:48 +03:00
kbr-
c294000113 Implement tuple_type_impl::to_string_impl. (#4645)
Resolves #4633.

Signed-off-by: Kamil Braun <kbraun@scylladb.com>
(cherry picked from commit 8995945052)
2019-07-08 11:08:10 +03:00
Avi Kivity
18bb2045aa Update seastar submodule
* seastar 4cdccae53b...82637dcab4 (1):
  > perftune.py: fix the i3 metal detection pattern

Ref #4057.
2019-07-02 13:49:21 +03:00
Avi Kivity
5e3276d08f Update seastar submodule to point to scylla-seastar.git
This allows us to add 3.1 specific patches to Seastar.
2019-07-02 13:47:50 +03:00
Piotr Sarna
acff367ea8 main: stop view builder conditionally
The view builder is started only if it's enabled in config,
via the view_building=true variable. Unfortunately, stopping
the builder was unconditional, which may result in failed
assertions during shutdown. To remedy this, view building
is stopped only if it was previously started.

Fixes #4589

(cherry picked from commit efa7951ea5)
2019-06-26 10:45:50 +03:00
Tomasz Grabiec
e39724a343 Merge "Sync schema before repair" from Asias
This series makes sure new schema is propagated to repair master and
follower nodes before repair.

Fixes #4575

* dev.git asias/repair_pull_schema_v2:
  migration_manager: Add sync_schema
  repair: Sync schema from follower nodes before repair

(cherry picked from commit 269e65a8db)
2019-06-26 09:35:42 +02:00
Asias He
31c4db83d8 repair: Avoid searching all the rows in to_repair_rows_on_wire
The repair_rows in row_list are sorted. It is only possible for the
current repair_row to share the same partition key with the last
repair_row inserted into repair_row_on_wire. So, no need to search from
the beginning of the repair_rows_on_wire to avoid quadratic complexity.
To fix, look at the last item in repair_rows_on_wire.

Fixes #4580
Message-Id: <08a8bfe90d1a6cf16b67c210151245879418c042.1561001271.git.asias@scylladb.com>

(cherry picked from commit b99c75429a)
2019-06-25 12:48:37 +02:00
Tomasz Grabiec
433cb93f7a Merge "Use same schema version for repair nodes" from Asias
This patch set fixes repair nodes using different schema version and
optimizes the hashing thanks to the fact now all nodes uses same schema
version.

Fixes: #4549

* seastar-dev.git asias/repair_use_same_schema.v3:
  repair: Use the same schema version for repair master and followers
  repair: Hash column kind and id instead of column name and type name

(cherry picked from commit cd1ff1fe02)
2019-06-23 20:57:16 +03:00
Avi Kivity
f553819919 Merge "Fix infinite paging for indexed queries" from Piotr
"
Fixes #4569

This series fixes the infinite paging for indexed queries issue.
Before this fix, paging indexes tended to end up in an infinite loop
of returning pages with 0 results, but has_more_pages flag set to true,
which confused the drivers.

Tests: unit(dev)
Branches: 3.0, 3.1
"

* 'fix_infinite_paging_for_indexed_queries' of https://github.com/psarna/scylla:
  tests: add test case for finishing index paging
  cql3: fix infinite paging for indexed queries

(cherry picked from commit 9229afe64f)
2019-06-23 20:54:27 +03:00
Nadav Har'El
48c34e7635 storage_proxy: fix race and crash in case of MV and other node shutdown
Recently, in merge commit 2718c90448,
we added the ability to cancel pending view-update requests when we detect
that the target node went down. This is important for view updates because
these have a very long timeout (5 minutes), and we wanted to make this
timeout even longer.

However, the implementation caused a race: Between *creating* the update's
request handler (create_write_response_handler()) and actually starting
the request with this handler (mutate_begin()), there is a preemption point
and we may end up deleting the request handler before starting the request.
So mutate_begin() must gracefully handle the case of a missing request
handler, and not crash with a segmentation fault as it did before this patch.

Eventually the lifetime management of request handlers could be refactored
to avoid this delicate fix (which requires more comments to explain than
code), or even better, it would be more correct to cancel individual writes
when a node goes down, not drop the entire handler (see issue #4523).
However, for now, let's not do such invasive changes and just fix bug that
we set out to fix.

Fixes #4386.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190620123949.22123-1-nyh@scylladb.com>
(cherry picked from commit 6e87bca65d)
2019-06-23 20:53:01 +03:00
Nadav Har'El
7f85b30941 Fix deciding whether a query uses indexing
The code that decides whether a query should used indexing was buggy - a partition key index might have influenced the decision even if the whole partition key was passed in the query (which effectively means that indexing it is not necessary).

Fixes #4539

Closes https://github.com/scylladb/scylla/pull/4544

Merged from branch 'fix_deciding_whether_a_query_uses_indexing' of git://github.com/psarna/scylla
  tests: add case for partition key index and filtering
  cql3: fix deciding if a query uses indexing

(cherry picked from commit 6aab1a61be)
2019-06-18 13:25:18 +03:00
Hagit Segev
7d14514b8a release: prepare for 3.1.0.rc2 2019-06-16 20:28:31 +03:00
Piotr Sarna
35f906f06f tests: add a test case for filtering clustering key
The test cases makes sure that clustering key restriction
columns are fetched for filtering if they form a clustering key prefix,
but not a primary key prefix (partition key columns are missing).

Ref #4541
Message-Id: <3612dc1c6c22c59ac9184220a2e7f24e8d18407c.1560410018.git.sarna@scylladb.com>

(cherry picked from commit 2c2122e057)
2019-06-16 14:36:52 +03:00
Piotr Sarna
2c50a484f5 cql3: fix qualifying clustering key restrictions for filtering
Clustering key restrictions can sometimes avoid filtering if they form
a prefix, but that can happen only if the whole partition key is
restricted as well.

Ref #4541
Message-Id: <9656396ee831e29c2b8d3ad4ef90c4a16ab71f4b.1560410018.git.sarna@scylladb.com>

(cherry picked from commit c4b935780b)
2019-06-16 14:36:52 +03:00
Piotr Sarna
24ddb46707 cql3: fix fetching clustering key columns for filtering
When a column is not present in the select clause, but used for
filtering, it usually needs to be fetched from replicas.
Sometimes it can be avoided, e.g. if primary key columns form a valid
prefix - then, they will be optimized out before filtering itself.
However, clustering key prefix can only be qualified for this
optimization if the whole partition key is restricted - otherwise
the clustering columns still need to be present for filtering.

This commit also fixes tests in cql_query_test suite, because they now
expect more values - columns fetched for filtering will be present as
well (only internally, the clients receive only data they asked for).

Fixes #4541
Message-Id: <f08ebae5562d570ece2bb7ee6c84e647345dfe48.1560410018.git.sarna@scylladb.com>

(cherry picked from commit adeea0a022)
2019-06-16 14:36:52 +03:00
Dejan Mircevski
f2fc3f32af tests: Add cquery_nofail() utility
Most tests await the result of cql_test_env::execute_cql().  Most
would also benefit from reporting errors with top-level location
included.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit a9849ecba7)
2019-06-16 14:36:52 +03:00
Asias He
c9f488ddc2 repair: Avoid writing row with same partition key and clustering key more than once
Consider

   master: row(pk=1, ck=1, col=10)
follower1: row(pk=1, ck=1, col=20)
follower2: row(pk=1, ck=1, col=30)

When repair runs, master fetches row(pk=1, ck=1, col=20) and row(pk=1,
ck=1, col=30) from follower1 and follower2.

Then repair master sends row(pk=1, ck=1, col=10) and row(pk=1, ck=1,
col=30) to follower1, follower1 will write the row with the same
pk=1, ck=1 twice, which violates uniqueness constraints.

To fix, we apply the row with same pk and ck into the previous row.
We only needs this on repair follower because the rows can come from
multiple nodes. While on repair master, we have a sstable writer per
follower, so the rows feed into sstable writer can come from only a
single node.

Tests: repair_additional_test.py:RepairAdditionalTest.repair_same_row_diff_value_3nodes_test
Fixes: #4510
Message-Id: <cb4fbba1e10fb0018116ffe5649c0870cda34575.1560405722.git.asias@scylladb.com>
(cherry picked from commit 9079790f85)
2019-06-16 10:23:58 +03:00
Asias He
46498e77b8 repair: Allow repair_row to initialize partially
On repair follower node, only decorated_key_with_hash and the
mutation_fragment inside repair_row are used in apply_rows() to apply
the rows to disk. Allow repair_row to initialize partially and throw if
the uninitialized member is accessed to be safe.
Message-Id: <b4e5cc050c11b1bafcf997076a3e32f20d059045.1560405722.git.asias@scylladb.com>

(cherry picked from commit 912ce53fc5)
2019-06-16 10:23:50 +03:00
Piotr Jastrzebski
440f33709e sstables: distinguish empty and missing cellpath
Before this patch mc sstables writer was ignoring
empty cellpaths. This is a wrong behaviour because
it is possible to have empty key in a map. In such case,
our writer creats a wrong sstable that we can't read back.
This is becaus a complex cell expects cellpath for each
simple cell it has. When writer ignores empty cellpath
it writes nothing and instead it should write a length
of zero to the file so that we know there's an empty cellpath.

Fixes #4533

Tests: unit(release)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <46242906c691a56a915ca5994b36baf87ee633b7.1560532790.git.piotr@scylladb.com>
(cherry picked from commit a41c9763a9)
2019-06-16 09:04:24 +03:00
Pekka Enberg
34696e1582 dist/docker: Switch to 3.1 release repository 2019-06-14 08:10:02 +03:00
Takuya ASADA
43bb290705 dist/docker/redhat: change user of scylla services to 'scylla'
On branch-3.1 / master, we are getting following error:

ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/data: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/hints: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/commitlog: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/view_hints: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)

It seems like owner verification of data directory fails because
scylla-server process is running in root but data directory owned by
scylla, so we should run services as scylla user.

Fixes #4536
Message-Id: <20190611113142.23599-1-syuu@scylladb.com>

(cherry picked from commit b1226fb15a)
2019-06-14 08:02:45 +03:00
Calle Wilund
53980816de api.hh: Fix bool parsing in req_param
Fixes #4525

req_param uses boost::lexical cast to convert text->var.
However, lexical_cast does not handle textual booleans,
thus param=true causes not only wrong values, but
exceptions.

Message-Id: <20190610140511.15478-1-calle@scylladb.com>
(cherry picked from commit 26702612f3)
2019-06-13 11:56:27 +03:00
Vlad Zolotarov
c1f4617530 fix_system_distributed_tables.py: declare the 'port' argument as 'int'
If a port value passed as a string this makes the cluster.connect() to
fail with Python3.4.

Let's fix this by explicitly declaring a 'port' argument as 'int'.

Fixes #4527

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190606133321.28225-1-vladz@scylladb.com>
(cherry picked from commit 20a610f6bc)
2019-06-13 11:45:54 +03:00
Raphael S. Carvalho
efde9416ed sstables: fix log of failure on large data entry deletion by fixing use-after-move
Fixes #4532.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190527200828.25339-1-raphaelsc@scylladb.com>
(cherry picked from commit 62aa0ea3fa)
2019-06-13 11:44:22 +03:00
Hagit Segev
224f9cee7e release: prepare for 3.1.0.rc1 2019-06-06 18:16:06 +03:00
Hagit Segev
cd1d13f805 release: prepare for 3.1.rc1 2019-06-06 15:32:54 +03:00
Pekka Enberg
899291bc9b relocate_python_scripts.py: Fix node-exporter install on Debian variants
The relocatable Python is built from Fedora packages. Unfortunately TLS
certificates are in a different location on Debian variants, which
causes "node_exporter_install" to fail as follows:

  Traceback (most recent call last):
    File "/usr/lib/scylla/libexec/node_exporter_install", line 58, in <module>
      data = curl('https://github.com/prometheus/node_exporter/releases/download/v{version}/node_exporter-{version}.linux-amd64.tar.gz'.format(version=VERSION), byte=True)
    File "/usr/lib/scylla/scylla_util.py", line 40, in curl
      with urllib.request.urlopen(req) as res:
    File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 222, in urlopen
      return opener.open(url, data, timeout)
    File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 525, in open
      response = self._open(req, data)
    File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 543, in _open
      '_open', req)
    File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 503, in _call_chain
      result = func(*args)
    File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 1360, in https_open
      context=self._context, check_hostname=self._check_hostname)
    File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 1319, in do_open
      raise URLError(err)
  urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)>
  Unable to retrieve version information
  node exporter setup failed.

Fix the problem by overriding the SSL_CERT_FILE environment variable to
point to the correct location of the TLS bundle.

Message-Id: <20190604175434.24534-1-penberg@scylladb.com>
(cherry picked from commit eb00095bca)
2019-06-05 22:20:06 +03:00
Paweł Dziepak
4130973f51 tests/perf_fast_forward: report average number of aio operations
perf_fast_forward is used to detect performance regressions. The two
main metrics used for this are fargments per second and the number of
the IO operations. The former is a median of a several runs, but the
latter is just the actual number of asynchronous IO operations performed
in the run that happened to be picked as a median frag/s-wise. There's
no always a direct correlation between frag/s and aio and the latter can
vary which makes the latter hard to compare.

In order to make this easier a new metric was introduced: "average aio"
which reports the average number of asynchronous IO operations performed
in a run. This should produce much more stable results and therefore
make the comparison more meaningful.
Message-Id: <20190430134401.19238-1-pdziepak@scylladb.com>

(cherry picked from commit 51e98e0e11)
2019-06-05 16:36:09 +03:00
Takuya ASADA
24e2c72888 dist/debian: support relocatable python3 on Debian variants
Unlike CentOS, Debian variants has python3 package on official repository,
so we don't have to use relocatable python3 on these distributions.
However, official python3 version is different on each distribution, we may
have issue because of that.
Also, our scripts and packaging implementation are becoming presuppose
existence of relocatable python3, it is causing issue on Debian
variants.

Switching to relocatable python3 on Debian variants avoid these issues,
it will easier to manage Scylla python3 environments accross multiple
distributions.

Fixes #4495

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190531112707.20082-1-syuu@scylladb.com>
(cherry picked from commit 25112408a7)
2019-06-03 17:42:26 +03:00
Raphael S. Carvalho
69cc7d89c8 compaction: do not unconditionally delete a new sstable in interrupted compaction
After incremental compaction, new sstables may have already replaced old
sstables at any point. Meaning that a new sstable is in-use by table and
a old sstable is already deleted when compaction itself is UNFINISHED.
Therefore, we should *NEVER* delete a new sstable unconditionally for an
interrupted compaction, or data loss could happen.
To fix it, we'll only delete new sstables that didn't replace anything
in the table, meaning they are unused.

Found the problem while auditting the code.

Fixes #4479.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190506134723.16639-1-raphaelsc@scylladb.com>
(cherry picked from commit ef5681486f)
2019-06-03 16:00:20 +03:00
Avi Kivity
5f6c5d566a Revert "dist/debian: support relocatable python3 on Debian variants"
This reverts commit 1fbab82553. Breaks build_deb.sh:

18:39:56 +	seastar/scripts/perftune.py seastar/scripts/seastar-addr2line seastar/scripts/perftune.py
18:39:56 Traceback (most recent call last):
18:39:56   File "./relocate_python_scripts.py", line 116, in <module>
18:39:56     fixup_scripts(archive, args.scripts)
18:39:56   File "./relocate_python_scripts.py", line 104, in fixup_scripts
18:39:56     fixup_script(output, script)
18:39:56   File "./relocate_python_scripts.py", line 79, in fixup_script
18:39:56     orig_stat = os.stat(script)
18:39:56 FileNotFoundError: [Errno 2] No such file or directory: '/data/jenkins/workspace/scylla-master/unified-deb/scylla/build/debian/scylla-package/+'
18:39:56 make[1]: *** [debian/rules:19: override_dh_auto_install] Error 1
2019-05-29 14:00:29 +03:00
Takuya ASADA
f32aea3834 reloc/python3: add license files on relocatable python3 package
It's better to have license files on our python3 distribution.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190516094329.13273-1-syuu@scylladb.com>
(cherry picked from commit 4b08a3f906)
2019-05-29 13:59:38 +03:00
Takuya ASADA
933260cb53 dist/ami: output scylla version information to AMI tags and description
Users may want to know which version of packages are used for the AMI,
it's good to have it on AMI tags and description.

To do this, we need to download .rpm from specified .repo, extract
version information from .rpm.

Fixes #4499

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190520123924.14060-2-syuu@scylladb.com>
(cherry picked from commit a55330a10b)
2019-05-29 13:59:38 +03:00
Takuya ASADA
f8ff0e1993 dist/ami: build scylla-python3 when specified --localrpm
Since we switched to relocatable python3, we need to build it for AMI too.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190520123924.14060-1-syuu@scylladb.com>
(cherry picked from commit abe44c28c5)
2019-05-29 13:59:38 +03:00
Takuya ASADA
1fbab82553 dist/debian: support relocatable python3 on Debian variants
Unlike CentOS, Debian variants has python3 package on official repository,
so we don't have to use relocatable python3 on these distributions.
However, official python3 version is different on each distribution, we may
have issue because of that.
Also, our scripts and packaging implementation are becoming presuppose
existence of relocatable python3, it is causing issue on Debian
variants.

Switching to relocatable python3 on Debian variants avoid these issues,
it will easier to manage Scylla python3 environments accross multiple
distributions.

Fixes #4495

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190526105138.677-1-syuu@scylladb.com>
(cherry picked from commit 4d119cbd6d)
2019-05-26 15:40:56 +03:00
Paweł Dziepak
c664615960 Merge "Fix empty counters handling in MC" from Piotr
"
Before this patchset empty counters were incorrectly persisted for
MC format. No value was written to disk for them. The correct way
is to still write a header that informs the counter is empty.

We also need to make sure that reading wrongly persisted empty
counters works because customers may have sstables with wrongly
persisted empty counters.

Fixes #4363
"

* 'haaawk/4363/v3' of github.com:scylladb/seastar-dev:
  sstables: add test for empty counters
  docs: add CorrectEmptyCounters to sstable-scylla-format
  sstables: Add a feature for empty counters in Scylla.db.
  sstables: Write header for empty counters
  sstables: Remove unused variables in make_counter_cell
  sstables: Handle empty counter value in read path

(cherry picked from commit 899ebe483a)
2019-05-23 22:15:00 +03:00
Benny Halevy
6a682dc5a2 cql3: select_statement: provide default initializer for parameters::_bypass_cache
Fixes #4503

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190521143300.22753-1-bhalevy@scylladb.com>
(cherry picked from commit fae4ca756c)
2019-05-23 08:28:01 +03:00
Gleb Natapov
c1271d08d3 cache_hitrate_calculator: make cache hitrate calculation preemptable
The calculation is done in a non preemptable loop over all tables, so if
numbers of tables is very large it may take a while since we also build
a string for gossiper state. Make the loop preemtable and also make
the string calculation more efficient by preallocating memory for it.
Message-Id: <20190516132748.6469-3-gleb@scylladb.com>

(cherry picked from commit 31bf4cfb5e)
2019-05-17 12:38:34 +02:00
Gleb Natapov
0d5c2501b3 cache_hitrate_calculator: do not copy stats map for each cpu
invoke_on_all() copies provided function for each shard it is executed
on, so by moving stats map into the capture we copy it for each shard
too. Avoid it by putting it into the top level object which is already
captured by reference.
Message-Id: <20190516132748.6469-2-gleb@scylladb.com>

(cherry picked from commit 4517c56a57)
2019-05-17 12:38:30 +02:00
Asias He
0dd84898ee repair: Fix use after free in remove_repair_meta for repair_metas
We should capture repair_metas so that it will not be freed until the
parallel_for_each is finished.

Fixes: #4333
Tests: repair_additional_test.py:RepairAdditionalTest.repair_kill_1_test
Message-Id: <237b20a359122a639330f9f78c67568410aef014.1557922403.git.asias@scylladb.com>
(cherry picked from commit 51c4f8cc47)
2019-05-16 11:12:09 +03:00
Avi Kivity
d568270d7f Merge "gc_clock: Fix hashing to be backwards-compatible" from Tomasz
"
Commit d0f9e00 changed the representation of the gc_clock::duration
from int32_t to int64_t.

Mutation hashing uses appending_hash<gc_clock::time_point>, which by
default feeds duration::count() into the hasher. duration::rep changed
from int32_t to int64_t, which changes the value of the hash.

This affects schema digest and query digests, resulting in mismatches
between nodes during a rolling upgrade.

Fixes #4460.
Refs #4485.
"

* tag 'fix-gc_clock-digest-v2.1' of github.com:tgrabiec/scylla:
  tests: Add test which verifies that schema digest stays the same
  tests: Add sstables for the schema digest test
  schema_tables, storage_service: Make schema digest insensitive to expired tombstones in empty partition
  db/schema_tables: Move feed_hash_for_schema_digest() to .cc file
  hashing: Introduce type-erased interface for the hasher
  hashing: Introduce C++ concept for the hasher
  hashers: Rename hasher to cryptopp_hasher
  gc_clock: Fix hashing to be backwards-compatible

(cherry picked from commit 82b91c1511)
2019-05-15 09:48:05 +03:00
Takuya ASADA
78c57f18c4 dist/ami: fix wrong path of SCYLLA-PRODUCT-FILE
Since other build_*.sh are for running inside extracted relocatable
package, they have SCYLLA-PRODUCT-FILE on top of the directory,
but build_ami.sh is not running in such condition, we need to run
SCYLLA-VERSION-GEN first, then refer to build/SCYLLA-PRODUCT-FILE.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190509110621.27468-1-syuu@scylladb.com>
(cherry picked from commit 19a973cd05)
2019-05-13 16:45:25 +03:00
Glauber Costa
ce27949797 Support AWS i3en instances
AWS just released their new instances, the i3en instances.  The instance
is verified already to work well with scylla, the only adjustments that
we need is advertise that we support it, and pre-fill the disk
information according to the performance numbers obtained by running the
instance.

Fixes #4486
Branches: 3.1

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190508170831.6003-1-glauber@scylladb.com>
(cherry picked from commit a23531ebd5)
2019-05-13 16:45:25 +03:00
Hagit Segev
6b47e23d29 release: prepare for 3.1.0.rc0 2019-05-13 15:03:34 +03:00
Piotr Sarna
1cb6cc0ac4 Revert "view: cache is_index for view pointer"
This reverts commit dbe8491655.
Caching the value was not done in a correct manner, which resulted
in longevity tests failures.

Fixes #4478

Branches: 3.1

Message-Id: <762ca9db618ca2ed7702372fbafe8ecd193dcf4d.1557129652.git.sarna@scylladb.com>
(cherry picked from commit cf8d2a5141)
2019-05-08 11:14:11 +03:00
Benny Halevy
67435eff15 time_window_backlog_tracker: fix use after free
Fixes #4465

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190430094209.13958-1-bhalevy@scylladb.com>
(cherry picked from commit 3a2fa82d6e)
2019-05-06 09:38:08 +03:00
Gleb Natapov
086ce13fb9 batchlog_manager: fix array out of bound access
endpoint_filter() function assumes that each bucket of
std::unordered_multimap contains elements with the same key only, so
its size can be used to know how many elements with a particular key
are there.  But this is not the case, elements with multiple keys may
share a bucket. Fix it by counting keys in other way.

Fixes #3229

Message-Id: <20190501133127.GE21208@scylladb.com>
(cherry picked from commit 95c6d19f6c)
2019-05-03 11:59:09 +03:00
Glauber Costa
eb9a8f4442 scylla_setup: respect user's decision not to call housekeeping
The setup script asks the user whether or not housekeeping should
be called, and in the first time the script is executed this decision
is respected.

However if the script is invoked again, that decision is not respected.

This is because the check has the form:

 if (housekeeping_cfg_file_exists) {
    version_check = ask_user();
 }
 if (version_check) { do_version_check() } else { dont_do_it() }

When it should have the form:

 if (housekeeping_cfg_file_exists) {
    version_check = ask_user();
    if (version_check) { do_version_check() } else { dont_do_it() }
 }

(Thanks python)

This is problematic in systems that are not connected to the internet, since
housekeeping will fail to run and crash the setup script.

Fixes #4462

Branches: master, branch-3.1
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190502034211.18435-1-glauber@scylladb.com>
(cherry picked from commit 47d04e49e8)
2019-05-03 09:57:31 +03:00
Glauber Costa
178fb5fe5f make scylla_util OS detection robust against empty lines
Newer versions of RHEL ship the os-release file with newlines in the
end, which our script was not prepared to handle. As such, scylla_setup
would fail.

This patch makes our OS detection robust against that.

Fixes #4473

Branches: master, branch-3.1
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190502152224.31307-1-glauber@scylladb.com>
(cherry picked from commit 99c00547ad)
2019-05-03 09:57:21 +03:00
355 changed files with 5055 additions and 1267 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -1,7 +1,7 @@
#!/bin/sh
PRODUCT=scylla
VERSION=666.development
VERSION=3.1.3
if test -f version
then

View File

@@ -22,6 +22,7 @@
#pragma once
#include <seastar/json/json_elements.hh>
#include <type_traits>
#include <boost/lexical_cast.hpp>
#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/classification.hpp>
@@ -231,7 +232,22 @@ public:
return;
}
try {
value = T{boost::lexical_cast<Base>(param)};
// boost::lexical_cast does not use boolalpha. Converting a
// true/false throws exceptions. We don't want that.
if constexpr (std::is_same_v<Base, bool>) {
// Cannot use boolalpha because we (probably) want to
// accept 1 and 0 as well as true and false. And True. And fAlse.
std::transform(param.begin(), param.end(), param.begin(), ::tolower);
if (param == "true" || param == "1") {
value = T(true);
} else if (param == "false" || param == "0") {
value = T(false);
} else {
throw boost::bad_lexical_cast{};
}
} else {
value = T{boost::lexical_cast<Base>(param)};
}
} catch (boost::bad_lexical_cast&) {
throw bad_param_exception(format("{} ({}): type error - should be {}", name, param, boost::units::detail::demangle(typeid(Base).name())));
}

View File

@@ -170,7 +170,9 @@ future<> service::start() {
return once_among_shards([this] {
return create_keyspace_if_missing();
}).then([this] {
return when_all_succeed(_role_manager->start(), _authorizer->start(), _authenticator->start());
return _role_manager->start().then([this] {
return when_all_succeed(_authorizer->start(), _authenticator->start());
});
}).then([this] {
_permissions_cache = std::make_unique<permissions_cache>(_permissions_cache_config, *this, log);
}).then([this] {

View File

@@ -61,6 +61,7 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
// - _last_row points at a direct predecessor of the next row which is going to be read.
// Used for populating continuity.
// - _population_range_starts_before_all_rows is set accordingly
// - _underlying is engaged and fast-forwarded
reading_from_underlying,
end_of_stream
@@ -99,7 +100,13 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
// forward progress is not guaranteed in case iterators are getting constantly invalidated.
bool _lower_bound_changed = false;
// Points to the underlying reader conforming to _schema,
// either to *_underlying_holder or _read_context->underlying().underlying().
flat_mutation_reader* _underlying = nullptr;
std::optional<flat_mutation_reader> _underlying_holder;
future<> do_fill_buffer(db::timeout_clock::time_point);
future<> ensure_underlying(db::timeout_clock::time_point);
void copy_from_cache_to_buffer();
future<> process_static_row(db::timeout_clock::time_point);
void move_to_end();
@@ -186,23 +193,22 @@ future<> cache_flat_mutation_reader::process_static_row(db::timeout_clock::time_
return make_ready_future<>();
} else {
_read_context->cache().on_row_miss();
return _read_context->get_next_fragment(timeout).then([this] (mutation_fragment_opt&& sr) {
if (sr) {
assert(sr->is_static_row());
maybe_add_to_cache(sr->as_static_row());
push_mutation_fragment(std::move(*sr));
}
maybe_set_static_row_continuous();
return ensure_underlying(timeout).then([this, timeout] {
return (*_underlying)(timeout).then([this] (mutation_fragment_opt&& sr) {
if (sr) {
assert(sr->is_static_row());
maybe_add_to_cache(sr->as_static_row());
push_mutation_fragment(std::move(*sr));
}
maybe_set_static_row_continuous();
});
});
}
}
inline
void cache_flat_mutation_reader::touch_partition() {
if (_snp->at_latest_version()) {
rows_entry& last_dummy = *_snp->version()->partition().clustered_rows().rbegin();
_snp->tracker()->touch(last_dummy);
}
_snp->touch();
}
inline
@@ -232,14 +238,36 @@ future<> cache_flat_mutation_reader::fill_buffer(db::timeout_clock::time_point t
});
}
inline
future<> cache_flat_mutation_reader::ensure_underlying(db::timeout_clock::time_point timeout) {
if (_underlying) {
return make_ready_future<>();
}
return _read_context->ensure_underlying(timeout).then([this, timeout] {
flat_mutation_reader& ctx_underlying = _read_context->underlying().underlying();
if (ctx_underlying.schema() != _schema) {
_underlying_holder = make_delegating_reader(ctx_underlying);
_underlying_holder->upgrade_schema(_schema);
_underlying = &*_underlying_holder;
} else {
_underlying = &ctx_underlying;
}
});
}
inline
future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_point timeout) {
if (_state == state::move_to_underlying) {
if (!_underlying) {
return ensure_underlying(timeout).then([this, timeout] {
return do_fill_buffer(timeout);
});
}
_state = state::reading_from_underlying;
_population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema);
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _read_context->fast_forward_to(position_range{_lower_bound, std::move(end)}, timeout).then([this, timeout] {
return _underlying->fast_forward_to(position_range{_lower_bound, std::move(end)}, timeout).then([this, timeout] {
return read_from_underlying(timeout);
});
}
@@ -280,7 +308,7 @@ future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_poin
inline
future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::time_point timeout) {
return consume_mutation_fragments_until(_read_context->underlying().underlying(),
return consume_mutation_fragments_until(*_underlying,
[this] { return _state != state::reading_from_underlying || is_buffer_full(); },
[this] (mutation_fragment mf) {
_read_context->cache().on_row_miss();

View File

@@ -596,6 +596,7 @@ scylla_core = (['database.cc',
'db/consistency_level.cc',
'db/system_keyspace.cc',
'db/system_distributed_keyspace.cc',
'db/size_estimates_virtual_reader.cc',
'db/schema_tables.cc',
'db/cql_type_parser.cc',
'db/legacy_schema_migrator.cc',

View File

@@ -130,6 +130,18 @@ query_options::query_options(std::unique_ptr<query_options> qo, ::shared_ptr<ser
}
query_options::query_options(std::unique_ptr<query_options> qo, ::shared_ptr<service::pager::paging_state> paging_state, int32_t page_size)
: query_options(qo->_consistency,
qo->get_timeout_config(),
std::move(qo->_names),
std::move(qo->_values),
std::move(qo->_value_views),
qo->_skip_metadata,
std::move(query_options::specific_options{page_size, paging_state, qo->_options.serial_consistency, qo->_options.timestamp}),
qo->_cql_serialization_format) {
}
query_options::query_options(std::vector<cql3::raw_value> values)
: query_options(
db::consistency_level::ONE, infinite_timeout_config, std::move(values))

View File

@@ -102,7 +102,7 @@ private:
public:
query_options(query_options&&) = default;
query_options(const query_options&) = delete;
explicit query_options(const query_options&) = default;
explicit query_options(db::consistency_level consistency,
const timeout_config& timeouts,
@@ -155,6 +155,7 @@ public:
explicit query_options(db::consistency_level, const timeout_config& timeouts,
std::vector<cql3::raw_value> values, specific_options options = specific_options::DEFAULT);
explicit query_options(std::unique_ptr<query_options>, ::shared_ptr<service::pager::paging_state> paging_state);
explicit query_options(std::unique_ptr<query_options>, ::shared_ptr<service::pager::paging_state> paging_state, int32_t page_size);
const timeout_config& get_timeout_config() const { return _timeout_config; }

View File

@@ -222,11 +222,9 @@ statement_restrictions::statement_restrictions(database& db,
auto& cf = db.find_column_family(schema);
auto& sim = cf.get_index_manager();
const allow_local_index allow_local(!_partition_key_restrictions->has_unrestricted_components(*_schema) && _partition_key_restrictions->is_all_eq());
bool has_queriable_clustering_column_index = _clustering_columns_restrictions->has_supporting_index(sim, allow_local);
bool has_queriable_pk_index = _partition_key_restrictions->has_supporting_index(sim, allow_local);
bool has_queriable_index = has_queriable_clustering_column_index
|| has_queriable_pk_index
|| _nonprimary_key_restrictions->has_supporting_index(sim, allow_local);
const bool has_queriable_clustering_column_index = _clustering_columns_restrictions->has_supporting_index(sim, allow_local);
const bool has_queriable_pk_index = _partition_key_restrictions->has_supporting_index(sim, allow_local);
const bool has_queriable_regular_index = _nonprimary_key_restrictions->has_supporting_index(sim, allow_local);
// At this point, the select statement if fully constructed, but we still have a few things to validate
process_partition_key_restrictions(has_queriable_pk_index, for_view, allow_filtering);
@@ -286,7 +284,7 @@ statement_restrictions::statement_restrictions(database& db,
}
if (!_nonprimary_key_restrictions->empty()) {
if (has_queriable_index) {
if (has_queriable_regular_index) {
_uses_secondary_indexing = true;
} else if (!allow_filtering) {
throw exceptions::invalid_request_exception("Cannot execute this query as it might involve data filtering and "
@@ -392,8 +390,9 @@ std::vector<const column_definition*> statement_restrictions::get_column_defs_fo
}
}
}
if (_clustering_columns_restrictions->needs_filtering(*_schema)) {
column_id first_filtering_id = _schema->clustering_key_columns().begin()->id +
const bool pk_has_unrestricted_components = _partition_key_restrictions->has_unrestricted_components(*_schema);
if (pk_has_unrestricted_components || _clustering_columns_restrictions->needs_filtering(*_schema)) {
column_id first_filtering_id = pk_has_unrestricted_components ? 0 : _schema->clustering_key_columns().begin()->id +
_clustering_columns_restrictions->num_prefix_columns_that_need_not_be_filtered();
for (auto&& cdef : _clustering_columns_restrictions->get_column_defs()) {
if (cdef->id >= first_filtering_id && !column_uses_indexing(cdef)) {
@@ -507,10 +506,9 @@ bool statement_restrictions::need_filtering() const {
int number_of_filtering_restrictions = _nonprimary_key_restrictions->size();
// If the whole partition key is restricted, it does not imply filtering
if (_partition_key_restrictions->has_unrestricted_components(*_schema) || !_partition_key_restrictions->is_all_eq()) {
number_of_filtering_restrictions += _partition_key_restrictions->size();
if (_clustering_columns_restrictions->has_unrestricted_components(*_schema)) {
number_of_filtering_restrictions += _clustering_columns_restrictions->size() - _clustering_columns_restrictions->prefix_size();
}
number_of_filtering_restrictions += _partition_key_restrictions->size() + _clustering_columns_restrictions->size();
} else if (_clustering_columns_restrictions->has_unrestricted_components(*_schema)) {
number_of_filtering_restrictions += _clustering_columns_restrictions->size() - _clustering_columns_restrictions->prefix_size();
}
return number_of_restricted_columns_for_indexing > 1
|| (number_of_restricted_columns_for_indexing == 0 && _partition_key_restrictions->empty() && !_clustering_columns_restrictions->empty())

View File

@@ -407,7 +407,7 @@ public:
}
bool ck_restrictions_need_filtering() const {
return _clustering_columns_restrictions->needs_filtering(*_schema);
return _partition_key_restrictions->has_unrestricted_components(*_schema) || _clustering_columns_restrictions->needs_filtering(*_schema);
}
/**

View File

@@ -83,6 +83,9 @@ void metadata::maybe_set_paging_state(::shared_ptr<const service::pager::paging_
assert(paging_state);
if (paging_state->get_remaining() > 0) {
set_paging_state(std::move(paging_state));
} else {
_flags.remove<flag::HAS_MORE_PAGES>();
_paging_state = nullptr;
}
}

View File

@@ -142,7 +142,7 @@ shared_ptr<selector::factory>
selectable::with_field_selection::new_selector_factory(database& db, schema_ptr s, std::vector<const column_definition*>& defs) {
auto&& factory = _selected->new_selector_factory(db, s, defs);
auto&& type = factory->new_instance()->get_type();
auto&& ut = dynamic_pointer_cast<const user_type_impl>(std::move(type));
auto&& ut = dynamic_pointer_cast<const user_type_impl>(type->underlying_type());
if (!ut) {
throw exceptions::invalid_request_exception(
format("Invalid field selection: {} of type {} is not a user type",

View File

@@ -166,7 +166,8 @@ alter_type_statement::add_or_alter::add_or_alter(const ut_name& name, bool is_ad
user_type alter_type_statement::add_or_alter::do_add(database& db, user_type to_update) const
{
if (get_idx_of_field(to_update, _field_name)) {
throw exceptions::invalid_request_exception(format("Cannot add new field {} to type {}: a field of the same name already exists", _field_name->name(), _name.to_string()));
throw exceptions::invalid_request_exception(format("Cannot add new field {} to type {}: a field of the same name already exists",
_field_name->to_string(), _name.to_string()));
}
std::vector<bytes> new_names(to_update->field_names());
@@ -174,7 +175,7 @@ user_type alter_type_statement::add_or_alter::do_add(database& db, user_type to_
std::vector<data_type> new_types(to_update->field_types());
auto&& add_type = _field_type->prepare(db, keyspace()).get_type();
if (add_type->references_user_type(to_update->_keyspace, to_update->_name)) {
throw exceptions::invalid_request_exception(format("Cannot add new field {} of type {} to type {} as this would create a circular reference", _field_name->name(), _field_type->to_string(), _name.to_string()));
throw exceptions::invalid_request_exception(format("Cannot add new field {} of type {} to type {} as this would create a circular reference", _field_name->to_string(), _field_type->to_string(), _name.to_string()));
}
new_types.push_back(std::move(add_type));
return user_type_impl::get_instance(to_update->_keyspace, to_update->_name, std::move(new_names), std::move(new_types));
@@ -184,13 +185,14 @@ user_type alter_type_statement::add_or_alter::do_alter(database& db, user_type t
{
std::optional<uint32_t> idx = get_idx_of_field(to_update, _field_name);
if (!idx) {
throw exceptions::invalid_request_exception(format("Unknown field {} in type {}", _field_name->name(), _name.to_string()));
throw exceptions::invalid_request_exception(format("Unknown field {} in type {}", _field_name->to_string(), _name.to_string()));
}
auto previous = to_update->field_types()[*idx];
auto new_type = _field_type->prepare(db, keyspace()).get_type();
if (!new_type->is_compatible_with(*previous)) {
throw exceptions::invalid_request_exception(format("Type {} in incompatible with previous type {} of field {} in user type {}", _field_type->to_string(), previous->as_cql3_type().to_string(), _field_name->name(), _name.to_string()));
throw exceptions::invalid_request_exception(format("Type {} in incompatible with previous type {} of field {} in user type {}",
_field_type->to_string(), previous->as_cql3_type().to_string(), _field_name->to_string(), _name.to_string()));
}
std::vector<data_type> new_types(to_update->field_types());

View File

@@ -76,7 +76,7 @@ public:
const bool _is_distinct;
const bool _allow_filtering;
const bool _is_json;
bool _bypass_cache;
bool _bypass_cache = false;
public:
parameters();
parameters(orderings_type orderings,

View File

@@ -440,8 +440,8 @@ indexed_table_select_statement::prepare_command_for_base_query(const query_optio
return cmd;
}
future<shared_ptr<cql_transport::messages::result_message>>
indexed_table_select_statement::execute_base_query(
future<foreign_ptr<lw_shared_ptr<query::result>>, lw_shared_ptr<query::read_command>>
indexed_table_select_statement::do_execute_base_query(
service::storage_proxy& proxy,
dht::partition_range_vector&& partition_ranges,
service::query_state& state,
@@ -492,22 +492,27 @@ indexed_table_select_statement::execute_base_query(
}).then([&merger]() {
return merger.get();
});
}).then([this, &proxy, &state, &options, now, cmd, paging_state = std::move(paging_state)] (foreign_ptr<lw_shared_ptr<query::result>> result) mutable {
return this->process_base_query_results(std::move(result), cmd, proxy, state, options, now, std::move(paging_state));
}).then([cmd] (foreign_ptr<lw_shared_ptr<query::result>> result) mutable {
return make_ready_future<foreign_ptr<lw_shared_ptr<query::result>>, lw_shared_ptr<query::read_command>>(std::move(result), std::move(cmd));
});
}
// Function for fetching the selected columns from a list of clustering rows.
// It is currently used only in our Secondary Index implementation - ordinary
// CQL SELECT statements do not have the syntax to request a list of rows.
// FIXME: The current implementation is very inefficient - it requests each
// row separately (and, incrementally, in parallel). Even multiple rows from a single
// partition are requested separately. This last case can be easily improved,
// but to implement the general case (multiple rows from multiple partitions)
// efficiently, we will need more support from other layers.
// Keys are ordered in token order (see #3423)
future<shared_ptr<cql_transport::messages::result_message>>
indexed_table_select_statement::execute_base_query(
service::storage_proxy& proxy,
dht::partition_range_vector&& partition_ranges,
service::query_state& state,
const query_options& options,
gc_clock::time_point now,
::shared_ptr<const service::pager::paging_state> paging_state) {
return do_execute_base_query(proxy, std::move(partition_ranges), state, options, now, paging_state).then(
[this, &proxy, &state, &options, now, paging_state = std::move(paging_state)] (foreign_ptr<lw_shared_ptr<query::result>> result, lw_shared_ptr<query::read_command> cmd) {
return process_base_query_results(std::move(result), std::move(cmd), proxy, state, options, now, std::move(paging_state));
});
}
future<foreign_ptr<lw_shared_ptr<query::result>>, lw_shared_ptr<query::read_command>>
indexed_table_select_statement::do_execute_base_query(
service::storage_proxy& proxy,
std::vector<primary_key>&& primary_keys,
service::query_state& state,
@@ -562,9 +567,23 @@ indexed_table_select_statement::execute_base_query(
});
}).then([&merger] () {
return merger.get();
}).then([cmd] (foreign_ptr<lw_shared_ptr<query::result>> result) mutable {
return make_ready_future<foreign_ptr<lw_shared_ptr<query::result>>, lw_shared_ptr<query::read_command>>(std::move(result), std::move(cmd));
});
}).then([this, &proxy, &state, &options, now, cmd, paging_state = std::move(paging_state)] (foreign_ptr<lw_shared_ptr<query::result>> result) mutable {
return this->process_base_query_results(std::move(result), cmd, proxy, state, options, now, std::move(paging_state));
});
}
future<shared_ptr<cql_transport::messages::result_message>>
indexed_table_select_statement::execute_base_query(
service::storage_proxy& proxy,
std::vector<primary_key>&& primary_keys,
service::query_state& state,
const query_options& options,
gc_clock::time_point now,
::shared_ptr<const service::pager::paging_state> paging_state) {
return do_execute_base_query(proxy, std::move(primary_keys), state, options, now, paging_state).then(
[this, &proxy, &state, &options, now, paging_state = std::move(paging_state)] (foreign_ptr<lw_shared_ptr<query::result>> result, lw_shared_ptr<query::read_command> cmd) {
return process_base_query_results(std::move(result), std::move(cmd), proxy, state, options, now, std::move(paging_state));
});
}
@@ -868,6 +887,60 @@ indexed_table_select_statement::do_execute(service::storage_proxy& proxy,
}
}
// Aggregated and paged filtering needs to aggregate the results from all pages
// in order to avoid returning partial per-page results (issue #4540).
// It's a little bit more complicated than regular aggregation, because each paging state
// needs to be translated between the base table and the underlying view.
// The routine below keeps fetching pages from the underlying view, which are then
// used to fetch base rows, which go straight to the result set builder.
// A local, internal copy of query_options is kept in order to keep updating
// the paging state between requesting data from replicas.
const bool aggregate = _selection->is_aggregate();
if (aggregate) {
const bool restrictions_need_filtering = _restrictions->need_filtering();
return do_with(cql3::selection::result_set_builder(*_selection, now, options.get_cql_serialization_format()), std::make_unique<cql3::query_options>(cql3::query_options(options)),
[this, &options, &proxy, &state, now, whole_partitions, partition_slices, restrictions_need_filtering] (cql3::selection::result_set_builder& builder, std::unique_ptr<cql3::query_options>& internal_options) {
// page size is set to the internal count page size, regardless of the user-provided value
internal_options.reset(new cql3::query_options(std::move(internal_options), options.get_paging_state(), DEFAULT_COUNT_PAGE_SIZE));
return repeat([this, &builder, &options, &internal_options, &proxy, &state, now, whole_partitions, partition_slices, restrictions_need_filtering] () {
auto consume_results = [this, &builder, &options, &internal_options, restrictions_need_filtering] (foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd) {
if (restrictions_need_filtering) {
query::result_view::consume(*results, cmd->slice, cql3::selection::result_set_builder::visitor(builder, *_schema, *_selection,
cql3::selection::result_set_builder::restrictions_filter(_restrictions, options, cmd->row_limit, _schema, cmd->slice.partition_row_limit())));
} else {
query::result_view::consume(*results, cmd->slice, cql3::selection::result_set_builder::visitor(builder, *_schema, *_selection));
}
};
if (whole_partitions || partition_slices) {
return find_index_partition_ranges(proxy, state, *internal_options).then(
[this, now, &state, &internal_options, &proxy, consume_results = std::move(consume_results)] (dht::partition_range_vector partition_ranges, ::shared_ptr<const service::pager::paging_state> paging_state) {
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? ::make_shared<service::pager::paging_state>(*paging_state) : nullptr));
return do_execute_base_query(proxy, std::move(partition_ranges), state, *internal_options, now, std::move(paging_state)).then(consume_results).then([has_more_pages] {
return stop_iteration(!has_more_pages);
});
});
} else {
return find_index_clustering_rows(proxy, state, *internal_options).then(
[this, now, &state, &internal_options, &proxy, consume_results = std::move(consume_results)] (std::vector<primary_key> primary_keys, ::shared_ptr<const service::pager::paging_state> paging_state) {
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? ::make_shared<service::pager::paging_state>(*paging_state) : nullptr));
return this->do_execute_base_query(proxy, std::move(primary_keys), state, *internal_options, now, std::move(paging_state)).then(consume_results).then([has_more_pages] {
return stop_iteration(!has_more_pages);
});
});
}
}).then([this, &builder, restrictions_need_filtering] () {
auto rs = builder.build();
update_stats_rows_read(rs->size());
_stats.filtered_rows_matched_total += restrictions_need_filtering ? rs->size() : 0;
auto msg = ::make_shared<cql_transport::messages::result_message::rows>(result(std::move(rs)));
return make_ready_future<shared_ptr<cql_transport::messages::result_message>>(std::move(msg));
});
});
}
if (whole_partitions || partition_slices) {
// In this case, can use our normal query machinery, which retrieves
// entire partitions or the same slice for many partitions.

View File

@@ -68,8 +68,8 @@ class select_statement : public cql_statement {
public:
using parameters = raw::select_statement::parameters;
using ordering_comparator_type = raw::select_statement::ordering_comparator_type;
protected:
static constexpr int DEFAULT_COUNT_PAGE_SIZE = 10000;
protected:
static thread_local const ::shared_ptr<parameters> _default_parameters;
schema_ptr _schema;
uint32_t _bound_terms;
@@ -229,6 +229,14 @@ private:
lw_shared_ptr<query::read_command>
prepare_command_for_base_query(const query_options& options, service::query_state& state, gc_clock::time_point now, bool use_paging);
future<foreign_ptr<lw_shared_ptr<query::result>>, lw_shared_ptr<query::read_command>>
do_execute_base_query(
service::storage_proxy& proxy,
dht::partition_range_vector&& partition_ranges,
service::query_state& state,
const query_options& options,
gc_clock::time_point now,
::shared_ptr<const service::pager::paging_state> paging_state);
future<shared_ptr<cql_transport::messages::result_message>>
execute_base_query(
service::storage_proxy& proxy,
@@ -238,6 +246,23 @@ private:
gc_clock::time_point now,
::shared_ptr<const service::pager::paging_state> paging_state);
// Function for fetching the selected columns from a list of clustering rows.
// It is currently used only in our Secondary Index implementation - ordinary
// CQL SELECT statements do not have the syntax to request a list of rows.
// FIXME: The current implementation is very inefficient - it requests each
// row separately (and, incrementally, in parallel). Even multiple rows from a single
// partition are requested separately. This last case can be easily improved,
// but to implement the general case (multiple rows from multiple partitions)
// efficiently, we will need more support from other layers.
// Keys are ordered in token order (see #3423)
future<foreign_ptr<lw_shared_ptr<query::result>>, lw_shared_ptr<query::read_command>>
do_execute_base_query(
service::storage_proxy& proxy,
std::vector<primary_key>&& primary_keys,
service::query_state& state,
const query_options& options,
gc_clock::time_point now,
::shared_ptr<const service::pager::paging_state> paging_state);
future<shared_ptr<cql_transport::messages::result_message>>
execute_base_query(
service::storage_proxy& proxy,

View File

@@ -32,7 +32,7 @@ tuples::component_spec_of(shared_ptr<column_specification> column, size_t compon
column->ks_name,
column->cf_name,
::make_shared<column_identifier>(format("{}[{:d}]", column->name, component), true),
static_pointer_cast<const tuple_type_impl>(column->type)->type(component));
static_pointer_cast<const tuple_type_impl>(column->type->underlying_type())->type(component));
}
shared_ptr<term>

View File

@@ -70,7 +70,7 @@ public:
private:
void validate_assignable_to(database& db, const sstring& keyspace, shared_ptr<column_specification> receiver) {
auto tt = dynamic_pointer_cast<const tuple_type_impl>(receiver->type);
auto tt = dynamic_pointer_cast<const tuple_type_impl>(receiver->type->underlying_type());
if (!tt) {
throw exceptions::invalid_request_exception(format("Invalid tuple type literal for {} of type {}", receiver->name, receiver->type->as_cql3_type()));
}

View File

@@ -260,6 +260,10 @@ void backlog_controller::adjust() {
float backlog_controller::backlog_of_shares(float shares) const {
size_t idx = 1;
// No control points means the controller is disabled.
if (_control_points.size() == 0) {
return 1.0f;
}
while ((idx < _control_points.size() - 1) && (_control_points[idx].output < shares)) {
idx++;
}
@@ -1929,7 +1933,7 @@ flat_mutation_reader make_multishard_streaming_reader(distributed<database>& db,
virtual flat_mutation_reader create_reader(
schema_ptr schema,
const dht::partition_range& range,
const query::partition_slice&,
const query::partition_slice& slice,
const io_priority_class& pc,
tracing::trace_state_ptr,
mutation_reader::forwarding fwd_mr) override {
@@ -1940,7 +1944,7 @@ flat_mutation_reader make_multishard_streaming_reader(distributed<database>& db,
_contexts[shard].read_operation = make_foreign(std::make_unique<utils::phased_barrier::operation>(cf.read_in_progress()));
_contexts[shard].semaphore = &cf.streaming_read_concurrency_semaphore();
return cf.make_streaming_reader(std::move(schema), *_contexts[shard].range, fwd_mr);
return cf.make_streaming_reader(std::move(schema), *_contexts[shard].range, slice, fwd_mr);
}
virtual void destroy_reader(shard_id shard, future<stopped_reader> reader_fut) noexcept override {
reader_fut.then([this, zis = shared_from_this(), shard] (stopped_reader&& reader) mutable {
@@ -1963,7 +1967,8 @@ flat_mutation_reader make_multishard_streaming_reader(distributed<database>& db,
return make_multishard_combining_reader(make_shared<streaming_reader_lifecycle_policy>(db), partitioner, std::move(s), pr, ps, pc,
std::move(trace_state), fwd_mr);
});
return make_flat_multi_range_reader(std::move(schema), std::move(ms), std::move(range_generator), schema->full_slice(),
auto&& full_slice = schema->full_slice();
return make_flat_multi_range_reader(std::move(schema), std::move(ms), std::move(range_generator), std::move(full_slice),
service::get_local_streaming_read_priority(), {}, mutation_reader::forwarding::no);
}

View File

@@ -458,6 +458,7 @@ private:
// This semaphore ensures that an operation like snapshot won't have its selected
// sstables deleted by compaction in parallel, a race condition which could
// easily result in failure.
// Locking order: must be acquired either independently or after _sstables_lock
seastar::semaphore _sstable_deletion_sem = {1};
// There are situations in which we need to stop writing sstables. Flushers will take
// the read lock, and the ones that wish to stop that process will take the write lock.
@@ -679,8 +680,13 @@ public:
// Single range overload.
flat_mutation_reader make_streaming_reader(schema_ptr schema, const dht::partition_range& range,
const query::partition_slice& slice,
mutation_reader::forwarding fwd_mr = mutation_reader::forwarding::no) const;
flat_mutation_reader make_streaming_reader(schema_ptr schema, const dht::partition_range& range) {
return make_streaming_reader(schema, range, schema->full_slice());
}
sstables::shared_sstable make_streaming_sstable_for_write(std::optional<sstring> subdir = {});
sstables::shared_sstable make_streaming_staging_sstable() {
return make_streaming_sstable_for_write("staging");
@@ -759,13 +765,7 @@ public:
// SSTable writes are now allowed again, and generation is updated to new_generation if != -1
// returns the amount of microseconds elapsed since we disabled writes.
std::chrono::steady_clock::duration enable_sstable_write(int64_t new_generation) {
if (new_generation != -1) {
update_sstables_known_generation(new_generation);
}
_sstables_lock.write_unlock();
return std::chrono::steady_clock::now() - _sstable_writes_disabled_at;
}
std::chrono::steady_clock::duration enable_sstable_write(int64_t new_generation);
// Make sure the generation numbers are sequential, starting from "start".
// Generations before "start" are left untouched.
@@ -935,7 +935,7 @@ public:
}
private:
future<row_locker::lock_holder> do_push_view_replica_updates(const schema_ptr& s, mutation&& m, db::timeout_clock::time_point timeout, mutation_source&& source) const;
future<row_locker::lock_holder> do_push_view_replica_updates(const schema_ptr& s, mutation&& m, db::timeout_clock::time_point timeout, mutation_source&& source, const io_priority_class& io_priority) const;
std::vector<view_ptr> affected_views(const schema_ptr& base, const mutation& update) const;
future<> generate_and_propagate_view_updates(const schema_ptr& base,
std::vector<view_ptr>&& views,

View File

@@ -396,10 +396,8 @@ std::unordered_set<gms::inet_address> db::batchlog_manager::endpoint_filter(cons
// grab a random member of up to two racks
for (auto& rack : racks) {
auto rack_members = validated.bucket(rack);
auto n = validated.bucket_size(rack_members);
auto cpy = boost::copy_range<std::vector<gms::inet_address>>(validated.equal_range(rack) | boost::adaptors::map_values);
std::uniform_int_distribution<size_t> rdist(0, n - 1);
std::uniform_int_distribution<size_t> rdist(0, cpy.size() - 1);
result.emplace(cpy[rdist(_e1)]);
}

View File

@@ -148,9 +148,18 @@ db::commitlog::descriptor::descriptor(const sstring& filename, const std::string
: descriptor([&filename, &fname_prefix]() {
std::smatch m;
// match both legacy and new version of commitlogs Ex: CommitLog-12345.log and CommitLog-4-12345.log.
std::regex rx("(?:.*/)?(?:Recycled-)?" + fname_prefix + "((\\d+)(" + SEPARATOR + "\\d+)?)" + FILENAME_EXTENSION);
std::regex rx("(?:Recycled-)?" + fname_prefix + "((\\d+)(" + SEPARATOR + "\\d+)?)" + FILENAME_EXTENSION);
std::string sfilename = filename;
if (!std::regex_match(sfilename, m, rx)) {
auto cbegin = sfilename.cbegin();
// skip the leading path
// Note: we're using rfind rather than the regex above
// since it may run out of stack in debug builds.
// See https://github.com/scylladb/scylla/issues/4464
auto pos = std::string(filename).rfind('/');
if (pos != std::string::npos) {
cbegin += pos + 1;
}
if (!std::regex_match(cbegin, sfilename.cend(), m, rx)) {
throw std::domain_error("Cannot parse the version of the file: " + filename);
}
if (m[3].length() == 0) {
@@ -420,7 +429,11 @@ class db::commitlog::segment : public enable_shared_from_this<segment>, public c
uint64_t _file_pos = 0;
uint64_t _flush_pos = 0;
bool _closed = false;
// Not the same as _closed since files can be reused
bool _closed_file = false;
bool _terminated = false;
using buffer_type = segment_manager::buffer_type;
@@ -486,7 +499,7 @@ public:
clogger.debug("Created new {} segment {}", active ? "active" : "reserve", *this);
}
~segment() {
if (!_closed) {
if (!_closed_file) {
_segment_manager->add_file_to_close(std::move(_file));
}
if (is_clean()) {
@@ -560,7 +573,7 @@ public:
// and we should have waited out all pending.
return me->_pending_ops.close().finally([me] {
return me->_file.truncate(me->_flush_pos).then([me] {
return me->_file.close();
return me->_file.close().finally([me] { me->_closed_file = true; });
});
});
});
@@ -1223,6 +1236,34 @@ void db::commitlog::segment_manager::flush_segments(bool force) {
}
}
/// \brief Helper for ensuring a file is closed if an exception is thrown.
///
/// The file provided by the file_fut future is passed to func.
/// * If func throws an exception E, the file is closed and we return
/// a failed future with E.
/// * If func returns a value V, the file is not closed and we return
/// a future with V.
/// Note that when an exception is not thrown, it is the
/// responsibility of func to make sure the file will be closed. It
/// can close the file itself, return it, or store it somewhere.
///
/// \tparam Func The type of function this wraps
/// \param file_fut A future that produces a file
/// \param func A function that uses a file
/// \return A future that passes the file produced by file_fut to func
/// and closes it if func fails
template <typename Func>
static auto close_on_failure(future<file> file_fut, Func func) {
return file_fut.then([func = std::move(func)](file f) {
return futurize_apply(func, f).handle_exception([f] (std::exception_ptr e) mutable {
return f.close().then_wrapped([f, e = std::move(e)] (future<> x) {
using futurator = futurize<std::result_of_t<Func(file)>>;
return futurator::make_exception_future(e);
});
});
});
}
future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager::allocate_segment_ex(const descriptor& d, sstring filename, open_flags flags, bool active) {
file_open_options opt;
opt.extent_allocation_size_hint = max_size;
@@ -1249,7 +1290,7 @@ future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager:
return fut;
});
return fut.then([this, d, active, filename](file f) {
return close_on_failure(std::move(fut), [this, d, active, filename] (file f) {
f = make_checked_file(commit_error_handler, f);
// xfs doesn't like files extended betond eof, so enlarge the file
return f.truncate(max_size).then([this, d, active, f, filename] () mutable {

View File

@@ -756,6 +756,8 @@ public:
val(enable_dangerous_direct_import_of_cassandra_counters, bool, false, Used, "Only turn this option on if you want to import tables from Cassandra containing counters, and you are SURE that no counters in that table were created in a version earlier than Cassandra 2.1." \
" It is not enough to have ever since upgraded to newer versions of Cassandra. If you EVER used a version earlier than 2.1 in the cluster where these SSTables come from, DO NOT TURN ON THIS OPTION! You will corrupt your data. You have been warned.") \
val(enable_shard_aware_drivers, bool, true, Used, "Enable native transport drivers to use connection-per-shard for better performance") \
val(abort_on_internal_error, bool, false, Used, "Abort the server instead of throwing exception when internal invariants are violated.") \
val(enable_3_1_0_compatibility_mode, bool, false, Used, "Set to true if the cluster was initially installed from 3.1.0. If it was upgraded from an earlier version, or installed from a later version, leave this set to false. This adjusts the communication protocol to work around a bug in Scylla 3.1.0") \
/* done! */
#define _make_value_member(name, type, deflt, status, desc, ...) \

View File

@@ -57,9 +57,30 @@ static ::shared_ptr<cql3::cql3_type::raw> parse_raw(const sstring& str) {
}
data_type db::cql_type_parser::parse(const sstring& keyspace, const sstring& str, lw_shared_ptr<user_types_metadata> user_types) {
static const thread_local std::unordered_map<sstring, cql3::cql3_type> native_types = []{
std::unordered_map<sstring, cql3::cql3_type> res;
for (auto& nt : cql3::cql3_type::values()) {
res.emplace(nt.to_string(), nt);
}
return res;
}();
auto i = native_types.find(str);
if (i != native_types.end()) {
return i->second.get_type();
}
if (!user_types && service::get_storage_proxy().local_is_initialized()) {
user_types = service::get_storage_proxy().local().get_db().local().find_keyspace(keyspace).metadata()->user_types();
}
// special-case top-level UDTs
if (user_types) {
auto& map = user_types->get_all_types();
auto i = map.find(utf8_type->decompose(str));
if (i != map.end()) {
return i->second;
}
}
auto raw = parse_raw(str);
auto cql = raw->prepare_internal(keyspace, user_types);

View File

@@ -57,7 +57,7 @@ void data_listeners::on_write(const schema_ptr& s, const frozen_mutation& m) {
}
}
toppartitons_item_key::operator sstring() const {
toppartitions_item_key::operator sstring() const {
std::ostringstream oss;
oss << key.key().with_schema(*schema);
return oss.str();
@@ -84,8 +84,11 @@ flat_mutation_reader toppartitions_data_listener::on_read(const schema_ptr& s, c
return std::move(rd);
}
dblog.trace("toppartitions_data_listener::on_read: {}.{}", s->ks_name(), s->cf_name());
return make_filtering_reader(std::move(rd), [this, &range, &slice, s = std::move(s)] (const dht::decorated_key& dk) {
_top_k_read.append(toppartitons_item_key{s, dk});
return make_filtering_reader(std::move(rd), [zis = this->weak_from_this(), &range, &slice, s = std::move(s)] (const dht::decorated_key& dk) {
// The data query may be executing after the toppartitions_data_listener object has been removed, so check
if (zis) {
zis->_top_k_read.append(toppartitions_item_key{s, dk});
}
return true;
});
}
@@ -95,7 +98,27 @@ void toppartitions_data_listener::on_write(const schema_ptr& s, const frozen_mut
return;
}
dblog.trace("toppartitions_data_listener::on_write: {}.{}", _ks, _cf);
_top_k_write.append(toppartitons_item_key{s, m.decorated_key(*s)});
_top_k_write.append(toppartitions_item_key{s, m.decorated_key(*s)});
}
toppartitions_data_listener::global_top_k::results
toppartitions_data_listener::globalize(top_k::results&& r) {
toppartitions_data_listener::global_top_k::results n;
n.reserve(r.size());
for (auto&& e : r) {
n.emplace_back(global_top_k::results::value_type{toppartitions_global_item_key(std::move(e.item)), e.count, e.error});
}
return n;
}
toppartitions_data_listener::top_k::results
toppartitions_data_listener::localize(const global_top_k::results& r) {
toppartitions_data_listener::top_k::results n;
n.reserve(r.size());
for (auto&& e : r) {
n.emplace_back(top_k::results::value_type{toppartitions_item_key(e.item), e.count, e.error});
}
return n;
}
toppartitions_query::toppartitions_query(distributed<database>& xdb, sstring ks, sstring cf,
@@ -108,20 +131,20 @@ future<> toppartitions_query::scatter() {
return _query.start(std::ref(_xdb), _ks, _cf);
}
using top_t = toppartitions_data_listener::top_k::results;
using top_t = toppartitions_data_listener::global_top_k::results;
future<toppartitions_query::results> toppartitions_query::gather(unsigned res_size) {
dblog.debug("toppartitions_query::gather");
auto map = [res_size, this] (toppartitions_data_listener& listener) {
dblog.trace("toppartitions_query::map_reduce with listener {}", &listener);
top_t rd = listener._top_k_read.top(res_size);
top_t wr = listener._top_k_write.top(res_size);
return std::tuple<top_t, top_t>{std::move(rd), std::move(wr)};
top_t rd = toppartitions_data_listener::globalize(listener._top_k_read.top(res_size));
top_t wr = toppartitions_data_listener::globalize(listener._top_k_write.top(res_size));
return make_foreign(std::make_unique<std::tuple<top_t, top_t>>(std::move(rd), std::move(wr)));
};
auto reduce = [this] (results res, std::tuple<top_t, top_t> rd_wr) {
res.read.append(std::get<0>(rd_wr));
res.write.append(std::get<1>(rd_wr));
auto reduce = [this] (results res, foreign_ptr<std::unique_ptr<std::tuple<top_t, top_t>>> rd_wr) {
res.read.append(toppartitions_data_listener::localize(std::get<0>(*rd_wr)));
res.write.append(toppartitions_data_listener::localize(std::get<1>(*rd_wr)));
return std::move(res);
};
return _query.map_reduce0(map, results{res_size}, reduce)

View File

@@ -24,12 +24,14 @@
#include <seastar/core/distributed.hh>
#include <seastar/core/future.hh>
#include <seastar/core/distributed.hh>
#include <seastar/core/weak_ptr.hh>
#include "schema.hh"
#include "flat_mutation_reader.hh"
#include "mutation_reader.hh"
#include "frozen_mutation.hh"
#include "utils/top_k.hh"
#include "schema_registry.hh"
#include <vector>
#include <set>
@@ -75,29 +77,54 @@ public:
};
struct toppartitons_item_key {
struct toppartitions_item_key {
schema_ptr schema;
dht::decorated_key key;
toppartitons_item_key(const schema_ptr& schema, const dht::decorated_key& key) : schema(schema), key(key) {}
toppartitons_item_key(const toppartitons_item_key& key) noexcept : schema(key.schema), key(key.key) {}
toppartitions_item_key(const schema_ptr& schema, const dht::decorated_key& key) : schema(schema), key(key) {}
toppartitions_item_key(const toppartitions_item_key& key) noexcept : schema(key.schema), key(key.key) {}
struct hash {
size_t operator()(const toppartitons_item_key& k) const {
size_t operator()(const toppartitions_item_key& k) const {
return std::hash<dht::token>()(k.key.token());
}
};
struct comp {
bool operator()(const toppartitons_item_key& k1, const toppartitons_item_key& k2) const {
return k1.schema == k2.schema && k1.key.equal(*k2.schema, k2.key);
bool operator()(const toppartitions_item_key& k1, const toppartitions_item_key& k2) const {
return k1.schema->id() == k2.schema->id() && k1.key.equal(*k2.schema, k2.key);
}
};
explicit operator sstring() const;
};
class toppartitions_data_listener : public data_listener {
// Like toppartitions_item_key, but uses global_schema_ptr, so can be safely transported across shards
struct toppartitions_global_item_key {
global_schema_ptr schema;
dht::decorated_key key;
toppartitions_global_item_key(toppartitions_item_key&& tik) : schema(std::move(tik.schema)), key(std::move(tik.key)) {}
operator toppartitions_item_key() const {
return toppartitions_item_key(schema, key);
}
struct hash {
size_t operator()(const toppartitions_global_item_key& k) const {
return std::hash<dht::token>()(k.key.token());
}
};
struct comp {
bool operator()(const toppartitions_global_item_key& k1, const toppartitions_global_item_key& k2) const {
return k1.schema.get()->id() == k2.schema.get()->id() && k1.key.equal(*k2.schema.get(), k2.key);
}
};
explicit operator sstring() const;
};
class toppartitions_data_listener : public data_listener, public weakly_referencable<toppartitions_data_listener> {
friend class toppartitions_query;
database& _db;
@@ -105,7 +132,11 @@ class toppartitions_data_listener : public data_listener {
sstring _cf;
public:
using top_k = utils::space_saving_top_k<toppartitons_item_key, toppartitons_item_key::hash, toppartitons_item_key::comp>;
using top_k = utils::space_saving_top_k<toppartitions_item_key, toppartitions_item_key::hash, toppartitions_item_key::comp>;
using global_top_k = utils::space_saving_top_k<toppartitions_global_item_key, toppartitions_global_item_key::hash, toppartitions_global_item_key::comp>;
public:
static global_top_k::results globalize(top_k::results&& r);
static top_k::results localize(const global_top_k::results& r);
private:
top_k _top_k_read;
top_k _top_k_write;

View File

@@ -118,8 +118,8 @@ future<> manager::stop() {
return _draining_eps_gate.close().finally([this] {
return parallel_for_each(_ep_managers, [] (auto& pair) {
return pair.second.stop();
}).finally([this] {
return pair.second.stop();
}).finally([this] {
_ep_managers.clear();
manager_logger.info("Stopped");
}).discard_result();
@@ -240,6 +240,8 @@ future<> manager::end_point_hints_manager::stop(drain should_drain) noexcept {
manager::end_point_hints_manager::end_point_hints_manager(const key_type& key, manager& shard_manager)
: _key(key)
, _shard_manager(shard_manager)
, _file_update_mutex_ptr(make_lw_shared<seastar::shared_mutex>())
, _file_update_mutex(*_file_update_mutex_ptr)
, _state(state_set::of<state::stopped>())
, _hints_dir(_shard_manager.hints_dir() / format("{}", _key).c_str())
, _sender(*this, _shard_manager.local_storage_proxy(), _shard_manager.local_db(), _shard_manager.local_gossiper())
@@ -248,6 +250,8 @@ manager::end_point_hints_manager::end_point_hints_manager(const key_type& key, m
manager::end_point_hints_manager::end_point_hints_manager(end_point_hints_manager&& other)
: _key(other._key)
, _shard_manager(other._shard_manager)
, _file_update_mutex_ptr(std::move(other._file_update_mutex_ptr))
, _file_update_mutex(*_file_update_mutex_ptr)
, _state(other._state)
, _hints_dir(std::move(other._hints_dir))
, _sender(other._sender, *this)
@@ -520,28 +524,35 @@ void manager::drain_for(gms::inet_address endpoint) {
manager_logger.trace("on_leave_cluster: {} is removed/decommissioned", endpoint);
with_gate(_draining_eps_gate, [this, endpoint] {
return futurize_apply([this, endpoint] () {
if (utils::fb_utilities::is_me(endpoint)) {
return parallel_for_each(_ep_managers, [] (auto& pair) {
return pair.second.stop(drain::yes).finally([&pair] {
return remove_file(pair.second.hints_dir().c_str());
return with_semaphore(drain_lock(), 1, [this, endpoint] {
return futurize_apply([this, endpoint] () {
if (utils::fb_utilities::is_me(endpoint)) {
return parallel_for_each(_ep_managers, [] (auto& pair) {
return pair.second.stop(drain::yes).finally([&pair] {
return with_file_update_mutex(pair.second, [&pair] {
return remove_file(pair.second.hints_dir().c_str());
});
});
}).finally([this] {
_ep_managers.clear();
});
}).finally([this] {
_ep_managers.clear();
});
} else {
ep_managers_map_type::iterator ep_manager_it = find_ep_manager(endpoint);
if (ep_manager_it != ep_managers_end()) {
return ep_manager_it->second.stop(drain::yes).finally([this, endpoint, hints_dir = ep_manager_it->second.hints_dir()] {
_ep_managers.erase(endpoint);
return remove_file(hints_dir.c_str());
});
}
} else {
ep_managers_map_type::iterator ep_manager_it = find_ep_manager(endpoint);
if (ep_manager_it != ep_managers_end()) {
return ep_manager_it->second.stop(drain::yes).finally([this, endpoint, &ep_man = ep_manager_it->second] {
return with_file_update_mutex(ep_man, [&ep_man] {
return remove_file(ep_man.hints_dir().c_str());
}).finally([this, endpoint] {
_ep_managers.erase(endpoint);
});
});
}
return make_ready_future<>();
}
}).handle_exception([endpoint] (auto eptr) {
manager_logger.error("Exception when draining {}: {}", endpoint, eptr);
return make_ready_future<>();
}
}).handle_exception([endpoint] (auto eptr) {
manager_logger.error("Exception when draining {}: {}", endpoint, eptr);
});
});
});
}

View File

@@ -276,7 +276,8 @@ public:
manager& _shard_manager;
hints_store_ptr _hints_store_anchor;
seastar::gate _store_gate;
seastar::shared_mutex _file_update_mutex;
lw_shared_ptr<seastar::shared_mutex> _file_update_mutex_ptr;
seastar::shared_mutex& _file_update_mutex;
enum class state {
can_hint, // hinting is currently allowed (used by the space_watchdog)
@@ -378,8 +379,20 @@ public:
return _state.contains(state::stopped);
}
seastar::shared_mutex& file_update_mutex() {
return _file_update_mutex;
/// \brief Safely runs a given functor under the file_update_mutex of \ref ep_man
///
/// Runs a given functor under the file_update_mutex of the given end_point_hints_manager instance.
/// This function is safe even if \ref ep_man gets destroyed before the future this function returns resolves
/// (as long as the \ref func call itself is safe).
///
/// \tparam Func Functor type.
/// \param ep_man end_point_hints_manager instance which file_update_mutex we want to lock.
/// \param func Functor to run under the lock.
/// \return Whatever \ref func returns.
template <typename Func>
friend inline auto with_file_update_mutex(end_point_hints_manager& ep_man, Func&& func) {
lw_shared_ptr<seastar::shared_mutex> lock_ptr = ep_man._file_update_mutex_ptr;
return with_lock(*lock_ptr, std::forward<Func>(func)).finally([lock_ptr] {});
}
const fs::path& hints_dir() const noexcept {
@@ -387,6 +400,10 @@ public:
}
private:
seastar::shared_mutex& file_update_mutex() noexcept {
return _file_update_mutex;
}
/// \brief Creates a new hints store object.
///
/// - Creates a hints store directory if doesn't exist: <shard_hints_dir>/<ep_key>
@@ -453,6 +470,7 @@ private:
stats _stats;
seastar::metrics::metric_groups _metrics;
std::unordered_set<ep_key_type> _eps_with_pending_hints;
seastar::semaphore _drain_lock = {1};
public:
manager(sstring hints_directory, std::vector<sstring> hinted_dcs, int64_t max_hint_window_ms, resource_manager&res_manager, distributed<database>& db);
@@ -531,6 +549,10 @@ public:
return _hints_dir_device_id;
}
seastar::semaphore& drain_lock() noexcept {
return _drain_lock;
}
void allow_hints();
void forbid_hints();
void forbid_hints_for_eps_with_pending_hints();

View File

@@ -89,16 +89,27 @@ future<> space_watchdog::stop() noexcept {
return std::move(_started);
}
// Called under the end_point_hints_manager::file_update_mutex() of the corresponding end_point_hints_manager instance.
future<> space_watchdog::scan_one_ep_dir(fs::path path, manager& shard_manager, ep_key_type ep_key) {
return lister::scan_dir(path, { directory_entry_type::regular }, [this, ep_key, &shard_manager] (fs::path dir, directory_entry de) {
// Put the current end point ID to state.eps_with_pending_hints when we see the second hints file in its directory
if (_files_count == 1) {
shard_manager.add_ep_with_pending_hints(ep_key);
}
++_files_count;
return do_with(std::move(path), [this, ep_key, &shard_manager] (fs::path& path) {
// It may happen that we get here and the directory has already been deleted in the context of manager::drain_for().
// In this case simply bail out.
return engine().file_exists(path.native()).then([this, ep_key, &shard_manager, &path] (bool exists) {
if (!exists) {
return make_ready_future<>();
} else {
return lister::scan_dir(path, { directory_entry_type::regular }, [this, ep_key, &shard_manager] (fs::path dir, directory_entry de) {
// Put the current end point ID to state.eps_with_pending_hints when we see the second hints file in its directory
if (_files_count == 1) {
shard_manager.add_ep_with_pending_hints(ep_key);
}
++_files_count;
return io_check(file_size, (dir / de.name.c_str()).c_str()).then([this] (uint64_t fsize) {
_total_size += fsize;
return io_check(file_size, (dir / de.name.c_str()).c_str()).then([this] (uint64_t fsize) {
_total_size += fsize;
});
});
}
});
});
}
@@ -136,7 +147,7 @@ void space_watchdog::on_timer() {
// continue to enumeration - there is no one to change them.
auto it = shard_manager.find_ep_manager(de.name);
if (it != shard_manager.ep_managers_end()) {
return with_lock(it->second.file_update_mutex(), [this, &shard_manager, dir = std::move(dir), ep_name = std::move(de.name)]() mutable {
return with_file_update_mutex(it->second, [this, &shard_manager, dir = std::move(dir), ep_name = std::move(de.name)] () mutable {
return scan_one_ep_dir(dir / ep_name, shard_manager, ep_key_type(ep_name));
});
} else {

View File

@@ -26,11 +26,17 @@
namespace db {
enum class schema_feature {
VIEW_VIRTUAL_COLUMNS
VIEW_VIRTUAL_COLUMNS,
// When set, the schema digest is calcualted in a way such that it doesn't change after all
// tombstones in an empty partition expire.
// See https://github.com/scylladb/scylla/issues/4485
DIGEST_INSENSITIVE_TO_EXPIRY,
};
using schema_features = enum_set<super_enum<schema_feature,
schema_feature::VIEW_VIRTUAL_COLUMNS
schema_feature::VIEW_VIRTUAL_COLUMNS,
schema_feature::DIGEST_INSENSITIVE_TO_EXPIRY
>>;
}

View File

@@ -587,9 +587,9 @@ future<utils::UUID> calculate_schema_digest(distributed<service::storage_proxy>&
return mutations;
});
};
auto reduce = [] (auto& hash, auto&& mutations) {
auto reduce = [features] (auto& hash, auto&& mutations) {
for (const mutation& m : mutations) {
feed_hash_for_schema_digest(hash, m);
feed_hash_for_schema_digest(hash, m, features);
}
};
return do_with(md5_hasher(), all_table_names(features), [features, map, reduce] (auto& hash, auto& tables) {
@@ -778,6 +778,13 @@ mutation compact_for_schema_digest(const mutation& m) {
return m_compacted;
}
void feed_hash_for_schema_digest(hasher& h, const mutation& m, schema_features features) {
auto compacted = compact_for_schema_digest(m);
if (!features.contains<schema_feature::DIGEST_INSENSITIVE_TO_EXPIRY>() || !compacted.partition().empty()) {
feed_hash(h, compact_for_schema_digest(m));
}
}
// Applies deletion of the "version" column to a system_schema.scylla_tables mutation.
static void delete_schema_version(mutation& m) {
if (m.column_family_id() != scylla_tables()->id()) {
@@ -1085,10 +1092,31 @@ static std::vector<V> get_list(const query::result_set_row& row, const sstring&
// Create types for a given keyspace. This takes care of topologically sorting user defined types.
template <typename T> static std::vector<user_type> create_types(keyspace_metadata& ks, T&& range) {
cql_type_parser::raw_builder builder(ks);
std::unordered_set<bytes> names;
for (const query::result_set_row& row : range) {
builder.add(row.get_nonnull<sstring>("type_name"),
get_list<sstring>(row, "field_names"),
get_list<sstring>(row, "field_types"));
auto name = row.get_nonnull<sstring>("type_name");
names.insert(to_bytes(name));
builder.add(std::move(name), get_list<sstring>(row, "field_names"), get_list<sstring>(row, "field_types"));
}
// Add user types that use any of the above types. From the
// database point of view they haven't changed since the content
// of system.types is the same for them. The runtime objects in
// the other hand now point to out of date types, so we need to
// recreate them.
for (const auto& p : ks.user_types()->get_all_types()) {
const user_type& t = p.second;
if (names.count(t->_name) != 0) {
continue;
}
for (const auto& name : names) {
if (t->references_user_type(t->_keyspace, name)) {
std::vector<sstring> field_types;
for (const data_type& f : t->field_types()) {
field_types.push_back(f->as_cql3_type().to_string());
}
builder.add(t->get_name_as_string(), t->string_field_names(), std::move(field_types));
}
}
}
return builder.build();
}
@@ -2727,8 +2755,9 @@ namespace legacy {
table_schema_version schema_mutations::digest() const {
md5_hasher h;
db::schema_tables::feed_hash_for_schema_digest(h, _columnfamilies);
db::schema_tables::feed_hash_for_schema_digest(h, _columns);
const db::schema_features no_features;
db::schema_tables::feed_hash_for_schema_digest(h, _columnfamilies, no_features);
db::schema_tables::feed_hash_for_schema_digest(h, _columns, no_features);
return utils::UUID_gen::get_name_UUID(h.finalize());
}

View File

@@ -215,10 +215,7 @@ index_metadata_kind deserialize_index_kind(sstring kind);
mutation compact_for_schema_digest(const mutation& m);
template<typename Hasher>
void feed_hash_for_schema_digest(Hasher& h, const mutation& m) {
feed_hash(h, compact_for_schema_digest(m));
}
void feed_hash_for_schema_digest(hasher&, const mutation&, schema_features);
} // namespace schema_tables
} // namespace db

View File

@@ -0,0 +1,328 @@
/*
* Copyright (C) 2019 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <boost/range/adaptor/indirected.hpp>
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include <boost/range/algorithm/find_if.hpp>
#include "clustering_bounds_comparator.hh"
#include "database_fwd.hh"
#include "db/system_keyspace.hh"
#include "dht/i_partitioner.hh"
#include "partition_range_compat.hh"
#include "range.hh"
#include "service/storage_service.hh"
#include "mutation_fragment.hh"
#include "sstables/sstables.hh"
#include "db/timeout_clock.hh"
#include "database.hh"
#include "db/size_estimates_virtual_reader.hh"
namespace db {
namespace size_estimates {
struct virtual_row {
const bytes& cf_name;
const token_range& tokens;
clustering_key_prefix as_key() const {
return clustering_key_prefix::from_exploded(std::vector<bytes_view>{cf_name, tokens.start, tokens.end});
}
};
struct virtual_row_comparator {
schema_ptr _schema;
virtual_row_comparator(schema_ptr schema) : _schema(schema) { }
bool operator()(const clustering_key_prefix& key1, const clustering_key_prefix& key2) {
return clustering_key_prefix::prefix_equality_less_compare(*_schema)(key1, key2);
}
bool operator()(const virtual_row& row, const clustering_key_prefix& key) {
return operator()(row.as_key(), key);
}
bool operator()(const clustering_key_prefix& key, const virtual_row& row) {
return operator()(key, row.as_key());
}
};
// Iterating over the cartesian product of cf_names and token_ranges.
class virtual_row_iterator : public std::iterator<std::input_iterator_tag, const virtual_row> {
std::reference_wrapper<const std::vector<bytes>> _cf_names;
std::reference_wrapper<const std::vector<token_range>> _ranges;
size_t _cf_names_idx = 0;
size_t _ranges_idx = 0;
public:
struct end_iterator_tag {};
virtual_row_iterator(const std::vector<bytes>& cf_names, const std::vector<token_range>& ranges)
: _cf_names(std::ref(cf_names))
, _ranges(std::ref(ranges))
{ }
virtual_row_iterator(const std::vector<bytes>& cf_names, const std::vector<token_range>& ranges, end_iterator_tag)
: _cf_names(std::ref(cf_names))
, _ranges(std::ref(ranges))
, _cf_names_idx(cf_names.size())
, _ranges_idx(ranges.size())
{
if (cf_names.empty() || ranges.empty()) {
// The product of an empty range with any range is an empty range.
// In this case we want the end iterator to be equal to the begin iterator,
// which has_ranges_idx = _cf_names_idx = 0.
_ranges_idx = _cf_names_idx = 0;
}
}
virtual_row_iterator& operator++() {
if (++_ranges_idx == _ranges.get().size() && ++_cf_names_idx < _cf_names.get().size()) {
_ranges_idx = 0;
}
return *this;
}
virtual_row_iterator operator++(int) {
virtual_row_iterator i(*this);
++(*this);
return i;
}
const value_type operator*() const {
return { _cf_names.get()[_cf_names_idx], _ranges.get()[_ranges_idx] };
}
bool operator==(const virtual_row_iterator& i) const {
return _cf_names_idx == i._cf_names_idx
&& _ranges_idx == i._ranges_idx;
}
bool operator!=(const virtual_row_iterator& i) const {
return !(*this == i);
}
};
/**
* Returns the keyspaces, ordered by name, as selected by the partition_range.
*/
static std::vector<sstring> get_keyspaces(const schema& s, const database& db, dht::partition_range range) {
struct keyspace_less_comparator {
const schema& _s;
keyspace_less_comparator(const schema& s) : _s(s) { }
dht::ring_position as_ring_position(const sstring& ks) {
auto pkey = partition_key::from_single_value(_s, utf8_type->decompose(ks));
return dht::global_partitioner().decorate_key(_s, std::move(pkey));
}
bool operator()(const sstring& ks1, const sstring& ks2) {
return as_ring_position(ks1).less_compare(_s, as_ring_position(ks2));
}
bool operator()(const sstring& ks, const dht::ring_position& rp) {
return as_ring_position(ks).less_compare(_s, rp);
}
bool operator()(const dht::ring_position& rp, const sstring& ks) {
return rp.less_compare(_s, as_ring_position(ks));
}
};
auto keyspaces = db.get_non_system_keyspaces();
auto cmp = keyspace_less_comparator(s);
boost::sort(keyspaces, cmp);
return boost::copy_range<std::vector<sstring>>(
range.slice(keyspaces, std::move(cmp)) | boost::adaptors::filtered([&s] (const auto& ks) {
// If this is a range query, results are divided between shards by the partition key (keyspace_name).
return shard_of(dht::global_partitioner().get_token(s,
partition_key::from_single_value(s, utf8_type->decompose(ks))))
== engine().cpu_id();
})
);
}
/**
* Makes a wrapping range of ring_position from a nonwrapping range of token, used to select sstables.
*/
static dht::partition_range as_ring_position_range(dht::token_range& r) {
std::optional<range<dht::ring_position>::bound> start_bound, end_bound;
if (r.start()) {
start_bound = {{ dht::ring_position(r.start()->value(), dht::ring_position::token_bound::start), r.start()->is_inclusive() }};
}
if (r.end()) {
end_bound = {{ dht::ring_position(r.end()->value(), dht::ring_position::token_bound::end), r.end()->is_inclusive() }};
}
return dht::partition_range(std::move(start_bound), std::move(end_bound), r.is_singular());
}
/**
* Add a new range_estimates for the specified range, considering the sstables associated with `cf`.
*/
static system_keyspace::range_estimates estimate(const column_family& cf, const token_range& r) {
int64_t count{0};
utils::estimated_histogram hist{0};
auto from_bytes = [] (auto& b) {
return dht::global_partitioner().from_sstring(utf8_type->to_string(b));
};
dht::token_range_vector ranges;
::compat::unwrap_into(
wrapping_range<dht::token>({{ from_bytes(r.start), false }}, {{ from_bytes(r.end) }}),
dht::token_comparator(),
[&] (auto&& rng) { ranges.push_back(std::move(rng)); });
for (auto&& r : ranges) {
auto rp_range = as_ring_position_range(r);
for (auto&& sstable : cf.select_sstables(rp_range)) {
count += sstable->estimated_keys_for_range(r);
hist.merge(sstable->get_stats_metadata().estimated_partition_size);
}
}
return {cf.schema(), r.start, r.end, count, count > 0 ? hist.mean() : 0};
}
future<std::vector<token_range>> get_local_ranges() {
auto& ss = service::get_local_storage_service();
return ss.get_local_tokens().then([&ss] (auto&& tokens) {
auto ranges = ss.get_token_metadata().get_primary_ranges_for(std::move(tokens));
std::vector<token_range> local_ranges;
auto to_bytes = [](const std::optional<dht::token_range::bound>& b) {
assert(b);
return utf8_type->decompose(dht::global_partitioner().to_sstring(b->value()));
};
// We merge the ranges to be compatible with how Cassandra shows it's size estimates table.
// All queries will be on that table, where all entries are text and there's no notion of
// token ranges form the CQL point of view.
auto left_inf = boost::find_if(ranges, [] (auto&& r) {
return !r.start() || r.start()->value() == dht::minimum_token();
});
auto right_inf = boost::find_if(ranges, [] (auto&& r) {
return !r.end() || r.start()->value() == dht::maximum_token();
});
if (left_inf != right_inf && left_inf != ranges.end() && right_inf != ranges.end()) {
local_ranges.push_back(token_range{to_bytes(right_inf->start()), to_bytes(left_inf->end())});
ranges.erase(left_inf);
ranges.erase(right_inf);
}
for (auto&& r : ranges) {
local_ranges.push_back(token_range{to_bytes(r.start()), to_bytes(r.end())});
}
boost::sort(local_ranges, [] (auto&& tr1, auto&& tr2) {
return utf8_type->less(tr1.start, tr2.start);
});
return local_ranges;
});
}
size_estimates_mutation_reader::size_estimates_mutation_reader(schema_ptr schema, const dht::partition_range& prange, const query::partition_slice& slice, streamed_mutation::forwarding fwd)
: impl(schema)
, _schema(std::move(schema))
, _prange(&prange)
, _slice(slice)
, _fwd(fwd)
{ }
future<> size_estimates_mutation_reader::get_next_partition() {
auto& db = service::get_local_storage_proxy().get_db().local();
if (!_keyspaces) {
_keyspaces = get_keyspaces(*_schema, db, *_prange);
_current_partition = _keyspaces->begin();
}
if (_current_partition == _keyspaces->end()) {
_end_of_stream = true;
return make_ready_future<>();
}
return get_local_ranges().then([&db, this] (auto&& ranges) {
auto estimates = this->estimates_for_current_keyspace(db, std::move(ranges));
auto mutations = db::system_keyspace::make_size_estimates_mutation(*_current_partition, std::move(estimates));
++_current_partition;
std::vector<mutation> ms;
ms.emplace_back(std::move(mutations));
_partition_reader = flat_mutation_reader_from_mutations(std::move(ms), _fwd);
});
}
future<> size_estimates_mutation_reader::fill_buffer(db::timeout_clock::time_point timeout) {
return do_until([this, timeout] { return is_end_of_stream() || is_buffer_full(); }, [this, timeout] {
if (!_partition_reader) {
return get_next_partition();
}
return _partition_reader->consume_pausable([this] (mutation_fragment mf) {
push_mutation_fragment(std::move(mf));
return stop_iteration(is_buffer_full());
}, timeout).then([this] {
if (_partition_reader->is_end_of_stream() && _partition_reader->is_buffer_empty()) {
_partition_reader = std::nullopt;
}
});
});
}
void size_estimates_mutation_reader::next_partition() {
clear_buffer_to_next_partition();
if (is_buffer_empty()) {
_partition_reader = std::nullopt;
}
}
future<> size_estimates_mutation_reader::fast_forward_to(const dht::partition_range& pr, db::timeout_clock::time_point timeout) {
clear_buffer();
_prange = &pr;
_keyspaces = std::nullopt;
_partition_reader = std::nullopt;
_end_of_stream = false;
return make_ready_future<>();
}
future<> size_estimates_mutation_reader::fast_forward_to(position_range pr, db::timeout_clock::time_point timeout) {
forward_buffer_to(pr.start());
_end_of_stream = false;
if (_partition_reader) {
return _partition_reader->fast_forward_to(std::move(pr), timeout);
}
return make_ready_future<>();
}
size_t size_estimates_mutation_reader::buffer_size() const {
if (_partition_reader) {
return flat_mutation_reader::impl::buffer_size() + _partition_reader->buffer_size();
}
return flat_mutation_reader::impl::buffer_size();
}
std::vector<db::system_keyspace::range_estimates>
size_estimates_mutation_reader::estimates_for_current_keyspace(const database& db, std::vector<token_range> local_ranges) const {
// For each specified range, estimate (crudely) mean partition size and partitions count.
auto pkey = partition_key::from_single_value(*_schema, utf8_type->decompose(*_current_partition));
auto cfs = db.find_keyspace(*_current_partition).metadata()->cf_meta_data();
auto cf_names = boost::copy_range<std::vector<bytes>>(cfs | boost::adaptors::transformed([] (auto&& cf) {
return utf8_type->decompose(cf.first);
}));
boost::sort(cf_names, [] (auto&& n1, auto&& n2) {
return utf8_type->less(n1, n2);
});
std::vector<db::system_keyspace::range_estimates> estimates;
for (auto& range : _slice.row_ranges(*_schema, pkey)) {
auto rows = boost::make_iterator_range(
virtual_row_iterator(cf_names, local_ranges),
virtual_row_iterator(cf_names, local_ranges, virtual_row_iterator::end_iterator_tag()));
auto rows_to_estimate = range.slice(rows, virtual_row_comparator(_schema));
for (auto&& r : rows_to_estimate) {
auto& cf = db.find_column_family(*_current_partition, utf8_type->to_string(r.cf_name));
estimates.push_back(estimate(cf, r.tokens));
if (estimates.size() >= _slice.partition_row_limit()) {
return estimates;
}
}
}
return estimates;
}
} // namespace size_estimates
} // namespace db

View File

@@ -21,33 +21,18 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <boost/range/adaptor/indirected.hpp>
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include <boost/range/algorithm/find_if.hpp>
#include "clustering_bounds_comparator.hh"
#include "database_fwd.hh"
#include "db/system_keyspace.hh"
#include "dht/i_partitioner.hh"
#include "mutation_reader.hh"
#include "partition_range_compat.hh"
#include "range.hh"
#include "service/storage_service.hh"
#include "mutation_fragment.hh"
#include "sstables/sstables.hh"
#include "db/timeout_clock.hh"
#include "database.hh"
namespace db {
namespace size_estimates {
struct token_range {
bytes start;
bytes end;
};
class size_estimates_mutation_reader final : public flat_mutation_reader::impl {
struct token_range {
bytes start;
bytes end;
};
schema_ptr _schema;
const dht::partition_range* _prange;
const query::partition_slice& _slice;
@@ -57,267 +42,18 @@ class size_estimates_mutation_reader final : public flat_mutation_reader::impl {
streamed_mutation::forwarding _fwd;
flat_mutation_reader_opt _partition_reader;
public:
size_estimates_mutation_reader(schema_ptr schema, const dht::partition_range& prange, const query::partition_slice& slice, streamed_mutation::forwarding fwd)
: impl(schema)
, _schema(std::move(schema))
, _prange(&prange)
, _slice(slice)
, _fwd(fwd)
{ }
size_estimates_mutation_reader(schema_ptr, const dht::partition_range&, const query::partition_slice&, streamed_mutation::forwarding);
virtual future<> fill_buffer(db::timeout_clock::time_point) override;
virtual void next_partition() override;
virtual future<> fast_forward_to(const dht::partition_range&, db::timeout_clock::time_point) override;
virtual future<> fast_forward_to(position_range, db::timeout_clock::time_point) override;
virtual size_t buffer_size() const override;
private:
future<> get_next_partition() {
// For each specified range, estimate (crudely) mean partition size and partitions count.
auto& db = service::get_local_storage_proxy().get_db().local();
if (!_keyspaces) {
_keyspaces = get_keyspaces(*_schema, db, *_prange);
_current_partition = _keyspaces->begin();
}
if (_current_partition == _keyspaces->end()) {
_end_of_stream = true;
return make_ready_future<>();
}
return get_local_ranges().then([&db, this] (auto&& ranges) {
auto estimates = this->estimates_for_current_keyspace(db, std::move(ranges));
auto mutations = db::system_keyspace::make_size_estimates_mutation(*_current_partition, std::move(estimates));
++_current_partition;
std::vector<mutation> ms;
ms.emplace_back(std::move(mutations));
_partition_reader = flat_mutation_reader_from_mutations(std::move(ms), _fwd);
});
}
public:
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override {
return do_until([this, timeout] { return is_end_of_stream() || is_buffer_full(); }, [this, timeout] {
if (!_partition_reader) {
return get_next_partition();
}
return _partition_reader->consume_pausable([this] (mutation_fragment mf) {
push_mutation_fragment(std::move(mf));
return stop_iteration(is_buffer_full());
}, timeout).then([this] {
if (_partition_reader->is_end_of_stream() && _partition_reader->is_buffer_empty()) {
_partition_reader = std::nullopt;
}
});
});
}
virtual void next_partition() override {
clear_buffer_to_next_partition();
if (is_buffer_empty()) {
_partition_reader = std::nullopt;
}
}
virtual future<> fast_forward_to(const dht::partition_range& pr, db::timeout_clock::time_point timeout) override {
clear_buffer();
_prange = &pr;
_keyspaces = std::nullopt;
_partition_reader = std::nullopt;
_end_of_stream = false;
return make_ready_future<>();
}
virtual future<> fast_forward_to(position_range pr, db::timeout_clock::time_point timeout) override {
forward_buffer_to(pr.start());
_end_of_stream = false;
if (_partition_reader) {
return _partition_reader->fast_forward_to(std::move(pr), timeout);
}
return make_ready_future<>();
}
virtual size_t buffer_size() const override {
if (_partition_reader) {
return flat_mutation_reader::impl::buffer_size() + _partition_reader->buffer_size();
}
return flat_mutation_reader::impl::buffer_size();
}
/**
* Returns the primary ranges for the local node.
* Used for testing as well.
*/
static future<std::vector<token_range>> get_local_ranges() {
auto& ss = service::get_local_storage_service();
return ss.get_local_tokens().then([&ss] (auto&& tokens) {
auto ranges = ss.get_token_metadata().get_primary_ranges_for(std::move(tokens));
std::vector<token_range> local_ranges;
auto to_bytes = [](const std::optional<dht::token_range::bound>& b) {
assert(b);
return utf8_type->decompose(dht::global_partitioner().to_sstring(b->value()));
};
// We merge the ranges to be compatible with how Cassandra shows it's size estimates table.
// All queries will be on that table, where all entries are text and there's no notion of
// token ranges form the CQL point of view.
auto left_inf = boost::find_if(ranges, [] (auto&& r) {
return !r.start() || r.start()->value() == dht::minimum_token();
});
auto right_inf = boost::find_if(ranges, [] (auto&& r) {
return !r.end() || r.start()->value() == dht::maximum_token();
});
if (left_inf != right_inf && left_inf != ranges.end() && right_inf != ranges.end()) {
local_ranges.push_back(token_range{to_bytes(right_inf->start()), to_bytes(left_inf->end())});
ranges.erase(left_inf);
ranges.erase(right_inf);
}
for (auto&& r : ranges) {
local_ranges.push_back(token_range{to_bytes(r.start()), to_bytes(r.end())});
}
boost::sort(local_ranges, [] (auto&& tr1, auto&& tr2) {
return utf8_type->less(tr1.start, tr2.start);
});
return local_ranges;
});
}
private:
struct virtual_row {
const bytes& cf_name;
const token_range& tokens;
clustering_key_prefix as_key() const {
return clustering_key_prefix::from_exploded(std::vector<bytes_view>{cf_name, tokens.start, tokens.end});
}
};
struct virtual_row_comparator {
schema_ptr _schema;
virtual_row_comparator(schema_ptr schema) : _schema(schema) { }
bool operator()(const clustering_key_prefix& key1, const clustering_key_prefix& key2) {
return clustering_key_prefix::prefix_equality_less_compare(*_schema)(key1, key2);
}
bool operator()(const virtual_row& row, const clustering_key_prefix& key) {
return operator()(row.as_key(), key);
}
bool operator()(const clustering_key_prefix& key, const virtual_row& row) {
return operator()(key, row.as_key());
}
};
class virtual_row_iterator : public std::iterator<std::input_iterator_tag, const virtual_row> {
std::reference_wrapper<const std::vector<bytes>> _cf_names;
std::reference_wrapper<const std::vector<token_range>> _ranges;
size_t _cf_names_idx = 0;
size_t _ranges_idx = 0;
public:
struct end_iterator_tag {};
virtual_row_iterator(const std::vector<bytes>& cf_names, const std::vector<token_range>& ranges)
: _cf_names(std::ref(cf_names))
, _ranges(std::ref(ranges))
{ }
virtual_row_iterator(const std::vector<bytes>& cf_names, const std::vector<token_range>& ranges, end_iterator_tag)
: _cf_names(std::ref(cf_names))
, _ranges(std::ref(ranges))
, _cf_names_idx(cf_names.size())
, _ranges_idx(ranges.size())
{ }
virtual_row_iterator& operator++() {
if (++_ranges_idx == _ranges.get().size() && ++_cf_names_idx < _cf_names.get().size()) {
_ranges_idx = 0;
}
return *this;
}
virtual_row_iterator operator++(int) {
virtual_row_iterator i(*this);
++(*this);
return i;
}
const value_type operator*() const {
return { _cf_names.get()[_cf_names_idx], _ranges.get()[_ranges_idx] };
}
bool operator==(const virtual_row_iterator& i) const {
return _cf_names_idx == i._cf_names_idx
&& _ranges_idx == i._ranges_idx;
}
bool operator!=(const virtual_row_iterator& i) const {
return !(*this == i);
}
};
future<> get_next_partition();
std::vector<db::system_keyspace::range_estimates>
estimates_for_current_keyspace(const database& db, std::vector<token_range> local_ranges) const {
auto pkey = partition_key::from_single_value(*_schema, utf8_type->decompose(*_current_partition));
auto cfs = db.find_keyspace(*_current_partition).metadata()->cf_meta_data();
auto cf_names = boost::copy_range<std::vector<bytes>>(cfs | boost::adaptors::transformed([] (auto&& cf) {
return utf8_type->decompose(cf.first);
}));
boost::sort(cf_names, [] (auto&& n1, auto&& n2) {
return utf8_type->less(n1, n2);
});
std::vector<db::system_keyspace::range_estimates> estimates;
for (auto& range : _slice.row_ranges(*_schema, pkey)) {
auto rows = boost::make_iterator_range(
virtual_row_iterator(cf_names, local_ranges),
virtual_row_iterator(cf_names, local_ranges, virtual_row_iterator::end_iterator_tag()));
auto rows_to_estimate = range.slice(rows, virtual_row_comparator(_schema));
for (auto&& r : rows_to_estimate) {
auto& cf = db.find_column_family(*_current_partition, utf8_type->to_string(r.cf_name));
estimates.push_back(estimate(cf, r.tokens));
if (estimates.size() >= _slice.partition_row_limit()) {
return estimates;
}
}
}
return estimates;
}
/**
* Returns the keyspaces, ordered by name, as selected by the partition_range.
*/
static ks_range get_keyspaces(const schema& s, const database& db, dht::partition_range range) {
struct keyspace_less_comparator {
const schema& _s;
keyspace_less_comparator(const schema& s) : _s(s) { }
dht::ring_position as_ring_position(const sstring& ks) {
auto pkey = partition_key::from_single_value(_s, utf8_type->decompose(ks));
return dht::global_partitioner().decorate_key(_s, std::move(pkey));
}
bool operator()(const sstring& ks1, const sstring& ks2) {
return as_ring_position(ks1).less_compare(_s, as_ring_position(ks2));
}
bool operator()(const sstring& ks, const dht::ring_position& rp) {
return as_ring_position(ks).less_compare(_s, rp);
}
bool operator()(const dht::ring_position& rp, const sstring& ks) {
return rp.less_compare(_s, as_ring_position(ks));
}
};
auto keyspaces = db.get_non_system_keyspaces();
auto cmp = keyspace_less_comparator(s);
boost::sort(keyspaces, cmp);
return boost::copy_range<ks_range>(range.slice(keyspaces, std::move(cmp)));
}
/**
* Makes a wrapping range of ring_position from a nonwrapping range of token, used to select sstables.
*/
static dht::partition_range as_ring_position_range(dht::token_range& r) {
std::optional<range<dht::ring_position>::bound> start_bound, end_bound;
if (r.start()) {
start_bound = {{ dht::ring_position(r.start()->value(), dht::ring_position::token_bound::start), r.start()->is_inclusive() }};
}
if (r.end()) {
end_bound = {{ dht::ring_position(r.end()->value(), dht::ring_position::token_bound::end), r.end()->is_inclusive() }};
}
return dht::partition_range(std::move(start_bound), std::move(end_bound), r.is_singular());
}
/**
* Add a new range_estimates for the specified range, considering the sstables associated with `cf`.
*/
static system_keyspace::range_estimates estimate(const column_family& cf, const token_range& r) {
int64_t count{0};
utils::estimated_histogram hist{0};
auto from_bytes = [] (auto& b) {
return dht::global_partitioner().from_sstring(utf8_type->to_string(b));
};
dht::token_range_vector ranges;
::compat::unwrap_into(
wrapping_range<dht::token>({{ from_bytes(r.start), false }}, {{ from_bytes(r.end) }}),
dht::token_comparator(),
[&] (auto&& rng) { ranges.push_back(std::move(rng)); });
for (auto&& r : ranges) {
auto rp_range = as_ring_position_range(r);
for (auto&& sstable : cf.select_sstables(rp_range)) {
count += sstable->estimated_keys_for_range(r);
hist.merge(sstable->get_stats_metadata().estimated_partition_size);
}
}
return {cf.schema(), r.start, r.end, count, count > 0 ? hist.mean() : 0};
}
estimates_for_current_keyspace(const database&, std::vector<token_range> local_ranges) const;
};
struct virtual_reader {
@@ -332,6 +68,12 @@ struct virtual_reader {
}
};
/**
* Returns the primary ranges for the local node.
* Used for testing as well.
*/
future<std::vector<token_range>> get_local_ranges();
} // namespace size_estimates
} // namespace db

View File

@@ -44,6 +44,11 @@ namespace db::view {
// columns. When reading the results from the scylla_views_builds_in_progress
// table, we adjust the clustering key (we shed the cpu_id column) and map
// back the regular columns.
// Since mutation fragment consumers expect clustering_row fragments
// not to be duplicated for given primary key, previous clustering key
// is stored between mutation fragments. If the clustering key becomes
// the same as the previous one (as a result of trimming cpu_id),
// the duplicated fragment is ignored.
class build_progress_virtual_reader {
database& _db;
@@ -55,6 +60,7 @@ class build_progress_virtual_reader {
const query::partition_slice& _legacy_slice;
query::partition_slice _slice;
flat_mutation_reader _underlying;
std::optional<clustering_key> _previous_clustering_key;
build_progress_reader(
schema_ptr legacy_schema,
@@ -79,7 +85,8 @@ class build_progress_virtual_reader {
pc,
std::move(trace_state),
fwd,
fwd_mr)) {
fwd_mr))
, _previous_clustering_key() {
}
const schema& underlying_schema() const {
@@ -127,8 +134,13 @@ class build_progress_virtual_reader {
legacy_in_progress_row.append_cell(_legacy_generation_number_col, std::move(c));
}
});
auto ck = adjust_ckey(scylla_in_progress_row.key());
if (_previous_clustering_key && ck.equal(*_schema, *_previous_clustering_key)) {
continue;
}
_previous_clustering_key = ck;
mf = clustering_row(
adjust_ckey(scylla_in_progress_row.key()),
std::move(ck),
std::move(scylla_in_progress_row.tomb()),
std::move(scylla_in_progress_row.marker()),
std::move(legacy_in_progress_row));
@@ -140,6 +152,8 @@ class build_progress_virtual_reader {
adjust_ckey(scylla_in_progress_rt.end),
scylla_in_progress_rt.end_kind,
scylla_in_progress_rt.tomb);
} else if (mf.is_end_of_partition()) {
_previous_clustering_key.reset();
}
push_mutation_fragment(std::move(mf));
}
@@ -192,4 +206,4 @@ public:
}
};
}
}

View File

@@ -83,7 +83,7 @@ view_info::view_info(const schema& schema, const raw_view_info& raw_view_info)
cql3::statements::select_statement& view_info::select_statement() const {
if (!_select_statement) {
shared_ptr<cql3::statements::raw::select_statement> raw;
if (is_index()) {
if (service::get_local_storage_service().db().local().find_column_family(base_id()).get_index_manager().is_global_index(_schema)) {
// Token column is the first clustering column
auto token_column_it = boost::range::find_if(_schema.all_columns(), std::mem_fn(&column_definition::is_clustering_key));
auto real_columns = _schema.all_columns() | boost::adaptors::filtered([this, token_column_it](const column_definition& cdef) {
@@ -143,10 +143,9 @@ void view_info::initialize_base_dependent_fields(const schema& base) {
}
bool view_info::is_index() const {
if (!_is_index) {
_is_index = service::get_local_storage_service().db().local().find_column_family(base_id()).get_index_manager().is_index(_schema);
}
return *_is_index;
//TODO(sarna): result of this call can be cached instead of calling index_manager::is_index every time
column_family& base_cf = service::get_local_storage_service().db().local().find_column_family(base_id());
return base_cf.get_index_manager().is_index(view_ptr(_schema.shared_from_this()));
}
namespace db {
@@ -450,7 +449,7 @@ void create_virtual_column(schema_builder& builder, const bytes& name, const dat
// A map has keys and values. We don't need these values,
// and can use empty values instead.
auto mtype = dynamic_pointer_cast<const map_type_impl>(type);
builder.with_column(name, map_type_impl::get_instance(mtype->get_values_type(), empty_type, true), column_kind::regular_column, column_view_virtual::yes);
builder.with_column(name, map_type_impl::get_instance(mtype->get_keys_type(), empty_type, true), column_kind::regular_column, column_view_virtual::yes);
} else if (ctype->is_set()) {
// A set's cell has nothing beyond the keys, so the
// virtual version of a set is, unfortunately, a complete
@@ -1158,6 +1157,10 @@ future<> view_builder::stop() {
return _sem.wait().then([this] {
_sem.broken();
return _build_step.join();
}).handle_exception_type([] (const broken_semaphore&) {
// ignored
}).handle_exception_type([] (const semaphore_timed_out&) {
// ignored
});
});
}

View File

@@ -24,7 +24,9 @@
namespace db::view {
future<> view_update_generator::start() {
_started = seastar::async([this]() mutable {
thread_attributes attr;
attr.sched_group = _db.get_streaming_scheduling_group();
_started = seastar::async(std::move(attr), [this]() mutable {
while (!_as.abort_requested()) {
if (_sstables_with_tables.empty()) {
_pending_sstables.wait().get();

45
dist/ami/build_ami.sh vendored
View File

@@ -1,6 +1,7 @@
#!/bin/bash -e
PRODUCT=$(cat SCYLLA-PRODUCT-FILE)
./SCYLLA-VERSION-GEN
PRODUCT=$(cat build/SCYLLA-PRODUCT-FILE)
if [ ! -e dist/ami/build_ami.sh ]; then
echo "run build_ami.sh in top of scylla dir"
@@ -16,6 +17,7 @@ print_usage() {
exit 1
}
LOCALRPM=0
REPO_FOR_INSTALL=
while [ $# -gt 0 ]; do
case "$1" in
"--localrpm")
@@ -23,10 +25,12 @@ while [ $# -gt 0 ]; do
shift 1
;;
"--repo")
REPO_FOR_INSTALL=$2
INSTALL_ARGS="$INSTALL_ARGS --repo $2"
shift 2
;;
"--repo-for-install")
REPO_FOR_INSTALL=$2
INSTALL_ARGS="$INSTALL_ARGS --repo-for-install $2"
shift 2
;;
@@ -123,6 +127,43 @@ if [ $LOCALRPM -eq 1 ]; then
cd ../..
cp build/$PRODUCT-ami/build/RPMS/noarch/$PRODUCT-ami-`cat build/$PRODUCT-ami/build/SCYLLA-VERSION-FILE`-`cat build/$PRODUCT-ami/build/SCYLLA-RELEASE-FILE`.*.noarch.rpm dist/ami/files/$PRODUCT-ami.noarch.rpm
fi
if [ ! -f dist/ami/files/$PRODUCT-python3.x86_64.rpm ]; then
reloc/python3/build_reloc.sh
reloc/python3/build_rpm.sh
cp build/redhat/RPMS/x86_64/$PRODUCT-python3*.x86_64.rpm dist/ami/files/$PRODUCT-python3.x86_64.rpm
fi
SCYLLA_VERSION=$(rpm -q --qf %{VERSION}-%{RELEASE} dist/ami/files/$PRODUCT.x86_64.rpm || true)
SCYLLA_AMI_VERSION=$(rpm -q --qf %{VERSION}-%{RELEASE} dist/ami/files/$PRODUCT-ami.noarch.rpm || true)
SCYLLA_JMX_VERSION=$(rpm -q --qf %{VERSION}-%{RELEASE} dist/ami/files/$PRODUCT-jmx.noarch.rpm || true)
SCYLLA_TOOLS_VERSION=$(rpm -q --qf %{VERSION}-%{RELEASE} dist/ami/files/$PRODUCT-tools.noarch.rpm || true)
SCYLLA_PYTHON3_VERSION=$(rpm -q --qf %{VERSION}-%{RELEASE} dist/ami/files/$PRODUCT-python3.x86_64.rpm || true)
else
if [ -z "$REPO_FOR_INSTALL" ]; then
print_usage
exit 1
fi
if [ ! -f /usr/bin/yumdownloader ]; then
if is_redhat_variant; then
sudo yum install /usr/bin/yumdownloader
else
sudo apt-get install yum-utils
fi
fi
if [ ! -f /usr/bin/curl ]; then
pkg_install curl
fi
TMPREPO=$(mktemp -u -p /etc/yum.repos.d/ --suffix .repo)
sudo curl -o $TMPREPO $REPO_FOR_INSTALL
rm -rf build/ami_packages
mkdir -p build/ami_packages
yumdownloader --downloaddir build/ami_packages/ $PRODUCT $PRODUCT-kernel-conf $PRODUCT-conf $PRODUCT-server $PRODUCT-debuginfo $PRODUCT-ami $PRODUCT-jmx $PRODUCT-tools-core $PRODUCT-tools $PRODUCT-python3
sudo rm -f $TMPREPO
SCYLLA_VERSION=$(rpm -q --qf %{VERSION}-%{RELEASE} build/ami_packages/$PRODUCT-[0-9]*.rpm || true)
SCYLLA_AMI_VERSION=$(rpm -q --qf %{VERSION}-%{RELEASE} build/ami_packages/$PRODUCT-ami-*.rpm || true)
SCYLLA_JMX_VERSION=$(rpm -q --qf %{VERSION}-%{RELEASE} build/ami_packages/$PRODUCT-jmx-*.rpm || true)
SCYLLA_TOOLS_VERSION=$(rpm -q --qf %{VERSION}-%{RELEASE} build/ami_packages/$PRODUCT-tools-[0-9]*.rpm || true)
SCYLLA_PYTHON3_VERSION=$(rpm -q --qf %{VERSION}-%{RELEASE} build/ami_packages/$PRODUCT-python3-*.rpm || true)
fi
cd dist/ami
@@ -147,4 +188,4 @@ if [ ! -d packer ]; then
cd -
fi
env PACKER_LOG=1 PACKER_LOG_PATH=../../build/ami.log packer/packer build -var-file=variables.json -var install_args="$INSTALL_ARGS" -var region="$REGION" -var source_ami="$AMI" -var ssh_username="$SSH_USERNAME" scylla.json
env PACKER_LOG=1 PACKER_LOG_PATH=../../build/ami.log packer/packer build -var-file=variables.json -var install_args="$INSTALL_ARGS" -var region="$REGION" -var source_ami="$AMI" -var ssh_username="$SSH_USERNAME" -var scylla_version="$SCYLLA_VERSION" -var scylla_ami_version="$SCYLLA_AMI_VERSION" -var scylla_jmx_version="$SCYLLA_JMX_VERSION" -var scylla_tools_version="$SCYLLA_TOOLS_VERSION" -var scylla_python3_version="$SCYLLA_PYTHON3_VERSION" scylla.json

10
dist/ami/scylla.json vendored
View File

@@ -56,7 +56,15 @@
"ssh_username": "{{user `ssh_username`}}",
"subnet_id": "{{user `subnet_id`}}",
"type": "amazon-ebs",
"user_data_file": "user_data.txt"
"user_data_file": "user_data.txt",
"ami_description": "scylla-{{user `scylla_version`}} scylla-ami-{{user `scylla_ami_version`}} scylla-jmx-{{user `scylla_jmx_version`}} scylla-tools-{{user `scylla_tools_version`}} scylla-python3-{{user `scylla_python3_version`}}",
"tags": {
"ScyllaVersion": "{{user `scylla_version`}}",
"ScyllaAMIVersion": "{{user `scylla_ami_version`}}",
"ScyllaJMXVersion": "{{user `scylla_jmx_version`}}",
"ScyllaToolsVersion": "{{user `scylla_tools_version`}}",
"ScyllaPython3Version": "{{user `scylla_python3_version`}}"
}
}
],
"provisioners": [

View File

@@ -60,6 +60,17 @@ if __name__ == "__main__":
disk_properties["read_bandwidth"] = 2015342735 * nr_disks
disk_properties["write_iops"] = 181500 * nr_disks
disk_properties["write_bandwidth"] = 808775652 * nr_disks
elif idata.instance_class() == "i3en":
if idata.instance() in ("i3en.large", "i3.xlarge", "i3en.2xlarge"):
disk_properties["read_iops"] = 46489
disk_properties["read_bandwidth"] = 353437280
disk_properties["write_iops"] = 36680
disk_properties["write_bandwidth"] = 164766656
else:
disk_properties["read_iops"] = 278478 * nr_disks
disk_properties["read_bandwidth"] = 3029172992 * nr_disks
disk_properties["write_iops"] = 221909 * nr_disks
disk_properties["write_bandwidth"] = 1020482432 * nr_disks
elif idata.instance_class() == "i2":
disk_properties["read_iops"] = 64000 * nr_disks
disk_properties["read_bandwidth"] = 507338935 * nr_disks

View File

@@ -95,6 +95,9 @@ def do_verify_package(pkg):
res = run('rpm -q {}'.format(pkg), silent=True, exception=False)
elif is_gentoo_variant():
res = 0 if len(glob.glob('/var/db/pkg/*/{}-*'.format(pkg))) else 1
else:
print("OS variant not recognized")
res = 1
if res != 0:
print('{} package is not installed.'.format(pkg))
sys.exit(1)
@@ -252,22 +255,22 @@ if __name__ == '__main__':
if not os.path.exists('/etc/scylla.d/housekeeping.cfg'):
version_check = interactive_ask_service('Do you want to enable Scylla to check if there is a newer version of Scylla available?', 'Yes - start the Scylla-housekeeping service to check for a newer version. This check runs periodically. No - skips this step.', version_check)
args.no_version_check = not version_check
if version_check:
with open('/etc/scylla.d/housekeeping.cfg', 'w') as f:
f.write('[housekeeping]\ncheck-version: True\n')
if is_systemd():
systemd_unit('scylla-housekeeping-daily.timer').unmask()
systemd_unit('scylla-housekeeping-restart.timer').unmask()
else:
with open('/etc/scylla.d/housekeeping.cfg', 'w') as f:
f.write('[housekeeping]\ncheck-version: False\n')
if is_systemd():
hk_daily = systemd_unit('scylla-housekeeping-daily.timer')
hk_daily.mask()
hk_daily.stop()
hk_restart = systemd_unit('scylla-housekeeping-restart.timer')
hk_restart.mask()
hk_restart.stop()
if version_check:
with open('/etc/scylla.d/housekeeping.cfg', 'w') as f:
f.write('[housekeeping]\ncheck-version: True\n')
if is_systemd():
systemd_unit('scylla-housekeeping-daily.timer').unmask()
systemd_unit('scylla-housekeeping-restart.timer').unmask()
else:
with open('/etc/scylla.d/housekeeping.cfg', 'w') as f:
f.write('[housekeeping]\ncheck-version: False\n')
if is_systemd():
hk_daily = systemd_unit('scylla-housekeeping-daily.timer')
hk_daily.mask()
hk_daily.stop()
hk_restart = systemd_unit('scylla-housekeeping-restart.timer')
hk_restart.mask()
hk_restart.stop()
cur_version=out('scylla --version', exception=False)
if len(cur_version) > 0:

View File

@@ -119,7 +119,7 @@ class aws_instance:
return self._type.split(".")[0]
def is_supported_instance_class(self):
if self.instance_class() in ['i2', 'i3']:
if self.instance_class() in ['i2', 'i3', 'i3en']:
return True
return False
@@ -128,7 +128,7 @@ class aws_instance:
instance_size = self.instance_size()
if instance_class in ['c3', 'c4', 'd2', 'i2', 'r3']:
return 'ixgbevf'
if instance_class in ['c5', 'c5d', 'f1', 'g3', 'h1', 'i3', 'm5', 'm5d', 'p2', 'p3', 'r4', 'x1']:
if instance_class in ['c5', 'c5d', 'f1', 'g3', 'h1', 'i3', 'i3en', 'm5', 'm5d', 'p2', 'p3', 'r4', 'x1']:
return 'ena'
if instance_class == 'm4':
if instance_size == '16xlarge':
@@ -304,7 +304,7 @@ def parse_os_release_line(line):
val = shlex.split(data)[0]
return (id, val.split(' ') if id == 'ID' or id == 'ID_LIKE' else val)
os_release = dict([parse_os_release_line(x) for x in open('/etc/os-release').read().splitlines()])
os_release = dict([parse_os_release_line(x) for x in open('/etc/os-release').read().splitlines() if re.match(r'\w+=', x) ])
def is_debian_variant():
d = os_release['ID_LIKE'] if 'ID_LIKE' in os_release else os_release['ID']
@@ -313,7 +313,7 @@ def is_debian_variant():
def is_redhat_variant():
d = os_release['ID_LIKE'] if 'ID_LIKE' in os_release else os_release['ID']
return ('rhel' in d) or ('fedora' in d)
return ('rhel' in d) or ('fedora' in d) or ('ol') in d
def is_gentoo_variant():
return ('gentoo' in os_release['ID'])
@@ -476,6 +476,8 @@ def create_perftune_conf(nic='eth0'):
def is_valid_nic(nic):
if len(nic) == 0:
return False
return os.path.exists('/sys/class/net/{}'.format(nic))
# Remove this when we do not support SET_NIC configuration value anymore

View File

@@ -16,7 +16,7 @@ Conflicts: {{product}}-server (<< 1.1)
Package: {{product}}-server
Architecture: amd64
Depends: ${shlibs:Depends}, ${misc:Depends}, adduser, hwloc-nox, {{product}}-conf, python-yaml, python-urwid, python-requests, curl, util-linux, python3-yaml, python3, uuid-runtime, pciutils, python3-pyudev, gzip, realpath | coreutils, num-utils, file
Depends: ${shlibs:Depends}, ${misc:Depends}, adduser, hwloc-nox, {{product}}-conf, {{product}}-python3, curl, util-linux, uuid-runtime, pciutils, gzip, realpath | coreutils, num-utils, file
Description: Scylla database server binaries
Scylla is a highly scalable, eventually consistent, distributed,
partitioned row DB.

30
dist/debian/debian/adjust_bin vendored Executable file
View File

@@ -0,0 +1,30 @@
#!/bin/bash -ex
root="$1"
bin="$2"
prefix="/opt/scylladb"
[ "$bin" = patchelf ] && exit 0
patchelf() {
# patchelf comes from the build system, so it needs the build system's ld.so and
# shared libraries. We can't use patchelf on patchelf itself, so invoke it via
# ld.so.
LD_LIBRARY_PATH="$root/$prefix/bin/libreloc" "$root/$prefix"/libreloc/ld.so "$root/$prefix"/libexec/patchelf "$@"
}
# We could add --set-rpath too, but then debugedit (called by rpmbuild) barfs
# on the result. So use LD_LIBRARY_PATH in the thunk, below.
patchelf \
--set-interpreter "$prefix/libreloc/ld.so" \
"$root/$prefix/libexec/$bin"
mkdir -p "$root/$prefix/bin"
cat > "$root/$prefix/bin/$bin" <<EOF
#!/bin/bash -e
export GNUTLS_SYSTEM_PRIORITY_FILE="\${GNUTLS_SYSTEM_PRIORITY_FILE-$prefix/libreloc/gnutls.config}"
export LD_LIBRARY_PATH="$prefix/libreloc"
exec -a "\$0" "$prefix/libexec/$bin" "\$@"
EOF
chmod +x "$root/$prefix/bin/$bin"

140
dist/debian/python3/build_deb.sh vendored Executable file
View File

@@ -0,0 +1,140 @@
#!/bin/bash -e
PRODUCT=$(cat SCYLLA-PRODUCT-FILE)
. /etc/os-release
print_usage() {
echo "build_deb.sh --reloc-pkg build/release/scylla-python3-package.tar.gz"
echo " --reloc-pkg specify relocatable package path"
exit 1
}
TARGET=stable
RELOC_PKG=
while [ $# -gt 0 ]; do
case "$1" in
"--reloc-pkg")
RELOC_PKG=$2
shift 2
;;
*)
print_usage
;;
esac
done
is_redhat_variant() {
[ -f /etc/redhat-release ]
}
is_debian_variant() {
[ -f /etc/debian_version ]
}
pkg_install() {
if is_redhat_variant; then
sudo yum install -y $1
elif is_debian_variant; then
sudo apt-get install -y $1
else
echo "Requires to install following command: $1"
exit 1
fi
}
if [ ! -e SCYLLA-RELOCATABLE-FILE ]; then
echo "do not directly execute build_deb.sh, use reloc/build_deb.sh instead."
exit 1
fi
if [ "$(arch)" != "x86_64" ]; then
echo "Unsupported architecture: $(arch)"
exit 1
fi
if [ -z "$RELOC_PKG" ]; then
print_usage
exit 1
fi
if [ ! -f "$RELOC_PKG" ]; then
echo "$RELOC_PKG is not found."
exit 1
fi
if [ -e debian ]; then
rm -rf debian
fi
if is_debian_variant; then
sudo apt-get -y update
fi
# this hack is needed since some environment installs 'git-core' package, it's
# subset of the git command and doesn't works for our git-archive-all script.
if is_redhat_variant && [ ! -f /usr/libexec/git-core/git-submodule ]; then
sudo yum install -y git
fi
if [ ! -f /usr/bin/git ]; then
pkg_install git
fi
if [ ! -f /usr/bin/python ]; then
pkg_install python
fi
if [ ! -f /usr/bin/debuild ]; then
pkg_install devscripts
fi
if [ ! -f /usr/bin/dh_testdir ]; then
pkg_install debhelper
fi
if [ ! -f /usr/bin/fakeroot ]; then
pkg_install fakeroot
fi
if [ ! -f /usr/bin/pystache ]; then
if is_redhat_variant; then
sudo yum install -y /usr/bin/pystache
elif is_debian_variant; then
sudo apt-get install -y python-pystache
fi
fi
if [ ! -f /usr/bin/file ]; then
pkg_install file
fi
if is_debian_variant && [ ! -f /usr/share/doc/python-pkg-resources/copyright ]; then
sudo apt-get install -y python-pkg-resources
fi
if [ "$ID" = "ubuntu" ] && [ ! -f /usr/share/keyrings/debian-archive-keyring.gpg ]; then
sudo apt-get install -y debian-archive-keyring
fi
if [ "$ID" = "debian" ] && [ ! -f /usr/share/keyrings/ubuntu-archive-keyring.gpg ]; then
sudo apt-get install -y ubuntu-archive-keyring
fi
if [ -z "$TARGET" ]; then
if is_debian_variant; then
if [ ! -f /usr/bin/lsb_release ]; then
pkg_install lsb-release
fi
TARGET=`lsb_release -c|awk '{print $2}'`
else
echo "Please specify target"
exit 1
fi
fi
RELOC_PKG_FULLPATH=$(readlink -f $RELOC_PKG)
RELOC_PKG_BASENAME=$(basename $RELOC_PKG)
SCYLLA_VERSION=$(cat SCYLLA-VERSION-FILE)
SCYLLA_RELEASE=$(cat SCYLLA-RELEASE-FILE)
ln -fv $RELOC_PKG_FULLPATH ../$PRODUCT-python3_$SCYLLA_VERSION-$SCYLLA_RELEASE.orig.tar.gz
cp -al dist/debian/python3/debian debian
if [ "$PRODUCT" != "scylla" ]; then
for i in debian/scylla-*;do
mv $i ${i/scylla-/$PRODUCT-}
done
fi
REVISION="1"
MUSTACHE_DIST="\"debian\": true, \"product\": \"$PRODUCT\", \"$PRODUCT\": true"
pystache dist/debian/python3/changelog.mustache "{ $MUSTACHE_DIST, \"version\": \"$SCYLLA_VERSION\", \"release\": \"$SCYLLA_RELEASE\", \"revision\": \"$REVISION\", \"codename\": \"$TARGET\" }" > debian/changelog
pystache dist/debian/python3/rules.mustache "{ $MUSTACHE_DIST }" > debian/rules
pystache dist/debian/python3/control.mustache "{ $MUSTACHE_DIST }" > debian/control
chmod a+rx debian/rules
debuild -rfakeroot -us -uc

View File

@@ -0,0 +1,5 @@
{{product}}-python3 ({{version}}-{{release}}-{{revision}}) {{codename}}; urgency=medium
* Initial release.
-- Takuya ASADA <syuu@scylladb.com> Mon, 24 Aug 2015 09:22:55 +0000

16
dist/debian/python3/control.mustache vendored Normal file
View File

@@ -0,0 +1,16 @@
Source: {{product}}-python3
Maintainer: Takuya ASADA <syuu@scylladb.com>
Homepage: http://scylladb.com
Section: python
Priority: optional
X-Python3-Version: >= 3.4
Standards-Version: 3.9.5
Package: {{product}}-python3
Architecture: amd64
Description: A standalone python3 interpreter that can be moved around different Linux machines
This is a self-contained python interpreter that can be moved around
different Linux machines as long as they run a new enough kernel (where
new enough is defined by whichever Python module uses any kernel
functionality). All shared libraries needed for the interpreter to
operate are shipped with it.

1
dist/debian/python3/debian/compat vendored Normal file
View File

@@ -0,0 +1 @@
9

995
dist/debian/python3/debian/copyright vendored Normal file
View File

@@ -0,0 +1,995 @@
This package was put together by Klee Dienes <klee@debian.org> from
sources from ftp.python.org:/pub/python, based on the Debianization by
the previous maintainers Bernd S. Brentrup <bsb@uni-muenster.de> and
Bruce Perens. Current maintainer is Matthias Klose <doko@debian.org>.
It was downloaded from http://python.org/
Copyright:
Upstream Author: Guido van Rossum <guido@cwi.nl> and others.
License:
The following text includes the Python license and licenses and
acknowledgements for incorporated software. The licenses can be read
in the HTML and texinfo versions of the documentation as well, after
installing the pythonx.y-doc package. Licenses for files not licensed
under the Python Licenses are found at the end of this file.
Python License
==============
A. HISTORY OF THE SOFTWARE
==========================
Python was created in the early 1990s by Guido van Rossum at Stichting
Mathematisch Centrum (CWI, see http://www.cwi.nl) in the Netherlands
as a successor of a language called ABC. Guido remains Python's
principal author, although it includes many contributions from others.
In 1995, Guido continued his work on Python at the Corporation for
National Research Initiatives (CNRI, see http://www.cnri.reston.va.us)
in Reston, Virginia where he released several versions of the
software.
In May 2000, Guido and the Python core development team moved to
BeOpen.com to form the BeOpen PythonLabs team. In October of the same
year, the PythonLabs team moved to Digital Creations (now Zope
Corporation, see http://www.zope.com). In 2001, the Python Software
Foundation (PSF, see http://www.python.org/psf/) was formed, a
non-profit organization created specifically to own Python-related
Intellectual Property. Zope Corporation is a sponsoring member of
the PSF.
All Python releases are Open Source (see http://www.opensource.org for
the Open Source Definition). Historically, most, but not all, Python
releases have also been GPL-compatible; the table below summarizes
the various releases.
Release Derived Year Owner GPL-
from compatible? (1)
0.9.0 thru 1.2 1991-1995 CWI yes
1.3 thru 1.5.2 1.2 1995-1999 CNRI yes
1.6 1.5.2 2000 CNRI no
2.0 1.6 2000 BeOpen.com no
1.6.1 1.6 2001 CNRI yes (2)
2.1 2.0+1.6.1 2001 PSF no
2.0.1 2.0+1.6.1 2001 PSF yes
2.1.1 2.1+2.0.1 2001 PSF yes
2.2 2.1.1 2001 PSF yes
2.1.2 2.1.1 2002 PSF yes
2.1.3 2.1.2 2002 PSF yes
2.2 and above 2.1.1 2001-now PSF yes
Footnotes:
(1) GPL-compatible doesn't mean that we're distributing Python under
the GPL. All Python licenses, unlike the GPL, let you distribute
a modified version without making your changes open source. The
GPL-compatible licenses make it possible to combine Python with
other software that is released under the GPL; the others don't.
(2) According to Richard Stallman, 1.6.1 is not GPL-compatible,
because its license has a choice of law clause. According to
CNRI, however, Stallman's lawyer has told CNRI's lawyer that 1.6.1
is "not incompatible" with the GPL.
Thanks to the many outside volunteers who have worked under Guido's
direction to make these releases possible.
B. TERMS AND CONDITIONS FOR ACCESSING OR OTHERWISE USING PYTHON
===============================================================
PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2
--------------------------------------------
1. This LICENSE AGREEMENT is between the Python Software Foundation
("PSF"), and the Individual or Organization ("Licensee") accessing and
otherwise using this software ("Python") in source or binary form and
its associated documentation.
2. Subject to the terms and conditions of this License Agreement, PSF
hereby grants Licensee a nonexclusive, royalty-free, world-wide
license to reproduce, analyze, test, perform and/or display publicly,
prepare derivative works, distribute, and otherwise use Python alone
or in any derivative version, provided, however, that PSF's License
Agreement and PSF's notice of copyright, i.e., "Copyright (c) 2001,
2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012,
2013, 2014 Python Software Foundation; All Rights Reserved" are
retained in Python alone or in any derivative version prepared by
Licensee.
3. In the event Licensee prepares a derivative work that is based on
or incorporates Python or any part thereof, and wants to make
the derivative work available to others as provided herein, then
Licensee hereby agrees to include in any such work a brief summary of
the changes made to Python.
4. PSF is making Python available to Licensee on an "AS IS"
basis. PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND
DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT
INFRINGE ANY THIRD PARTY RIGHTS.
5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON,
OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
6. This License Agreement will automatically terminate upon a material
breach of its terms and conditions.
7. Nothing in this License Agreement shall be deemed to create any
relationship of agency, partnership, or joint venture between PSF and
Licensee. This License Agreement does not grant permission to use PSF
trademarks or trade name in a trademark sense to endorse or promote
products or services of Licensee, or any third party.
8. By copying, installing or otherwise using Python, Licensee
agrees to be bound by the terms and conditions of this License
Agreement.
BEOPEN.COM LICENSE AGREEMENT FOR PYTHON 2.0
-------------------------------------------
BEOPEN PYTHON OPEN SOURCE LICENSE AGREEMENT VERSION 1
1. This LICENSE AGREEMENT is between BeOpen.com ("BeOpen"), having an
office at 160 Saratoga Avenue, Santa Clara, CA 95051, and the
Individual or Organization ("Licensee") accessing and otherwise using
this software in source or binary form and its associated
documentation ("the Software").
2. Subject to the terms and conditions of this BeOpen Python License
Agreement, BeOpen hereby grants Licensee a non-exclusive,
royalty-free, world-wide license to reproduce, analyze, test, perform
and/or display publicly, prepare derivative works, distribute, and
otherwise use the Software alone or in any derivative version,
provided, however, that the BeOpen Python License is retained in the
Software, alone or in any derivative version prepared by Licensee.
3. BeOpen is making the Software available to Licensee on an "AS IS"
basis. BEOPEN MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, BEOPEN MAKES NO AND
DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE WILL NOT
INFRINGE ANY THIRD PARTY RIGHTS.
4. BEOPEN SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF THE
SOFTWARE FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS
AS A RESULT OF USING, MODIFYING OR DISTRIBUTING THE SOFTWARE, OR ANY
DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
5. This License Agreement will automatically terminate upon a material
breach of its terms and conditions.
6. This License Agreement shall be governed by and interpreted in all
respects by the law of the State of California, excluding conflict of
law provisions. Nothing in this License Agreement shall be deemed to
create any relationship of agency, partnership, or joint venture
between BeOpen and Licensee. This License Agreement does not grant
permission to use BeOpen trademarks or trade names in a trademark
sense to endorse or promote products or services of Licensee, or any
third party. As an exception, the "BeOpen Python" logos available at
http://www.pythonlabs.com/logos.html may be used according to the
permissions granted on that web page.
7. By copying, installing or otherwise using the software, Licensee
agrees to be bound by the terms and conditions of this License
Agreement.
CNRI LICENSE AGREEMENT FOR PYTHON 1.6.1
---------------------------------------
1. This LICENSE AGREEMENT is between the Corporation for National
Research Initiatives, having an office at 1895 Preston White Drive,
Reston, VA 20191 ("CNRI"), and the Individual or Organization
("Licensee") accessing and otherwise using Python 1.6.1 software in
source or binary form and its associated documentation.
2. Subject to the terms and conditions of this License Agreement, CNRI
hereby grants Licensee a nonexclusive, royalty-free, world-wide
license to reproduce, analyze, test, perform and/or display publicly,
prepare derivative works, distribute, and otherwise use Python 1.6.1
alone or in any derivative version, provided, however, that CNRI's
License Agreement and CNRI's notice of copyright, i.e., "Copyright (c)
1995-2001 Corporation for National Research Initiatives; All Rights
Reserved" are retained in Python 1.6.1 alone or in any derivative
version prepared by Licensee. Alternately, in lieu of CNRI's License
Agreement, Licensee may substitute the following text (omitting the
quotes): "Python 1.6.1 is made available subject to the terms and
conditions in CNRI's License Agreement. This Agreement together with
Python 1.6.1 may be located on the Internet using the following
unique, persistent identifier (known as a handle): 1895.22/1013. This
Agreement may also be obtained from a proxy server on the Internet
using the following URL: http://hdl.handle.net/1895.22/1013".
3. In the event Licensee prepares a derivative work that is based on
or incorporates Python 1.6.1 or any part thereof, and wants to make
the derivative work available to others as provided herein, then
Licensee hereby agrees to include in any such work a brief summary of
the changes made to Python 1.6.1.
4. CNRI is making Python 1.6.1 available to Licensee on an "AS IS"
basis. CNRI MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, CNRI MAKES NO AND
DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON 1.6.1 WILL NOT
INFRINGE ANY THIRD PARTY RIGHTS.
5. CNRI SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
1.6.1 FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON 1.6.1,
OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
6. This License Agreement will automatically terminate upon a material
breach of its terms and conditions.
7. This License Agreement shall be governed by the federal
intellectual property law of the United States, including without
limitation the federal copyright law, and, to the extent such
U.S. federal law does not apply, by the law of the Commonwealth of
Virginia, excluding Virginia's conflict of law provisions.
Notwithstanding the foregoing, with regard to derivative works based
on Python 1.6.1 that incorporate non-separable material that was
previously distributed under the GNU General Public License (GPL), the
law of the Commonwealth of Virginia shall govern this License
Agreement only as to issues arising under or with respect to
Paragraphs 4, 5, and 7 of this License Agreement. Nothing in this
License Agreement shall be deemed to create any relationship of
agency, partnership, or joint venture between CNRI and Licensee. This
License Agreement does not grant permission to use CNRI trademarks or
trade name in a trademark sense to endorse or promote products or
services of Licensee, or any third party.
8. By clicking on the "ACCEPT" button where indicated, or by copying,
installing or otherwise using Python 1.6.1, Licensee agrees to be
bound by the terms and conditions of this License Agreement.
ACCEPT
CWI LICENSE AGREEMENT FOR PYTHON 0.9.0 THROUGH 1.2
--------------------------------------------------
Copyright (c) 1991 - 1995, Stichting Mathematisch Centrum Amsterdam,
The Netherlands. All rights reserved.
Permission to use, copy, modify, and distribute this software and its
documentation for any purpose and without fee is hereby granted,
provided that the above copyright notice appear in all copies and that
both that copyright notice and this permission notice appear in
supporting documentation, and that the name of Stichting Mathematisch
Centrum or CWI not be used in advertising or publicity pertaining to
distribution of the software without specific, written prior
permission.
STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO
THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE
FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Licenses and Acknowledgements for Incorporated Software
=======================================================
Mersenne Twister
----------------
The `_random' module includes code based on a download from
`http://www.math.keio.ac.jp/~matumoto/MT2002/emt19937ar.html'. The
following are the verbatim comments from the original code:
A C-program for MT19937, with initialization improved 2002/1/26.
Coded by Takuji Nishimura and Makoto Matsumoto.
Before using, initialize the state by using init_genrand(seed)
or init_by_array(init_key, key_length).
Copyright (C) 1997 - 2002, Makoto Matsumoto and Takuji Nishimura,
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
3. The names of its contributors may not be used to endorse or promote
products derived from this software without specific prior written
permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Any feedback is very welcome.
http://www.math.keio.ac.jp/matumoto/emt.html
email: matumoto@math.keio.ac.jp
Sockets
-------
The `socket' module uses the functions, `getaddrinfo', and
`getnameinfo', which are coded in separate source files from the WIDE
Project, `http://www.wide.ad.jp/about/index.html'.
Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
3. Neither the name of the project nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND
GAI_ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE
FOR GAI_ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON GAI_ANY THEORY OF LIABILITY, WHETHER
IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN GAI_ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
OF THE POSSIBILITY OF SUCH DAMAGE.
Floating point exception control
--------------------------------
The source for the `fpectl' module includes the following notice:
---------------------------------------------------------------------
/ Copyright (c) 1996. \
| The Regents of the University of California. |
| All rights reserved. |
| |
| Permission to use, copy, modify, and distribute this software for |
| any purpose without fee is hereby granted, provided that this en- |
| tire notice is included in all copies of any software which is or |
| includes a copy or modification of this software and in all |
| copies of the supporting documentation for such software. |
| |
| This work was produced at the University of California, Lawrence |
| Livermore National Laboratory under contract no. W-7405-ENG-48 |
| between the U.S. Department of Energy and The Regents of the |
| University of California for the operation of UC LLNL. |
| |
| DISCLAIMER |
| |
| This software was prepared as an account of work sponsored by an |
| agency of the United States Government. Neither the United States |
| Government nor the University of California nor any of their em- |
| ployees, makes any warranty, express or implied, or assumes any |
| liability or responsibility for the accuracy, completeness, or |
| usefulness of any information, apparatus, product, or process |
| disclosed, or represents that its use would not infringe |
| privately-owned rights. Reference herein to any specific commer- |
| cial products, process, or service by trade name, trademark, |
| manufacturer, or otherwise, does not necessarily constitute or |
| imply its endorsement, recommendation, or favoring by the United |
| States Government or the University of California. The views and |
| opinions of authors expressed herein do not necessarily state or |
| reflect those of the United States Government or the University |
| of California, and shall not be used for advertising or product |
\ endorsement purposes. /
---------------------------------------------------------------------
Cookie management
-----------------
The `Cookie' module contains the following notice:
Copyright 2000 by Timothy O'Malley <timo@alum.mit.edu>
All Rights Reserved
Permission to use, copy, modify, and distribute this software
and its documentation for any purpose and without fee is hereby
granted, provided that the above copyright notice appear in all
copies and that both that copyright notice and this permission
notice appear in supporting documentation, and that the name of
Timothy O'Malley not be used in advertising or publicity
pertaining to distribution of the software without specific, written
prior permission.
Timothy O'Malley DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS
SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS, IN NO EVENT SHALL Timothy O'Malley BE LIABLE FOR
ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE.
Execution tracing
-----------------
The `trace' module contains the following notice:
portions copyright 2001, Autonomous Zones Industries, Inc., all rights...
err... reserved and offered to the public under the terms of the
Python 2.2 license.
Author: Zooko O'Whielacronx
http://zooko.com/
mailto:zooko@zooko.com
Copyright 2000, Mojam Media, Inc., all rights reserved.
Author: Skip Montanaro
Copyright 1999, Bioreason, Inc., all rights reserved.
Author: Andrew Dalke
Copyright 1995-1997, Automatrix, Inc., all rights reserved.
Author: Skip Montanaro
Copyright 1991-1995, Stichting Mathematisch Centrum, all rights reserved.
Permission to use, copy, modify, and distribute this Python software and
its associated documentation for any purpose without fee is hereby
granted, provided that the above copyright notice appears in all copies,
and that both that copyright notice and this permission notice appear in
supporting documentation, and that the name of neither Automatrix,
Bioreason or Mojam Media be used in advertising or publicity pertaining
to distribution of the software without specific, written prior
permission.
UUencode and UUdecode functions
-------------------------------
The `uu' module contains the following notice:
Copyright 1994 by Lance Ellinghouse
Cathedral City, California Republic, United States of America.
All Rights Reserved
Permission to use, copy, modify, and distribute this software and its
documentation for any purpose and without fee is hereby granted,
provided that the above copyright notice appear in all copies and that
both that copyright notice and this permission notice appear in
supporting documentation, and that the name of Lance Ellinghouse
not be used in advertising or publicity pertaining to distribution
of the software without specific, written prior permission.
LANCE ELLINGHOUSE DISCLAIMS ALL WARRANTIES WITH REGARD TO
THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS, IN NO EVENT SHALL LANCE ELLINGHOUSE CENTRUM BE LIABLE
FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Modified by Jack Jansen, CWI, July 1995:
- Use binascii module to do the actual line-by-line conversion
between ascii and binary. This results in a 1000-fold speedup. The C
version is still 5 times faster, though.
- Arguments more compliant with python standard
XML Remote Procedure Calls
--------------------------
The `xmlrpclib' module contains the following notice:
The XML-RPC client interface is
Copyright (c) 1999-2002 by Secret Labs AB
Copyright (c) 1999-2002 by Fredrik Lundh
By obtaining, using, and/or copying this software and/or its
associated documentation, you agree that you have read, understood,
and will comply with the following terms and conditions:
Permission to use, copy, modify, and distribute this software and
its associated documentation for any purpose and without fee is
hereby granted, provided that the above copyright notice appears in
all copies, and that both that copyright notice and this permission
notice appear in supporting documentation, and that the name of
Secret Labs AB or the author not be used in advertising or publicity
pertaining to distribution of the software without specific, written
prior permission.
SECRET LABS AB AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD
TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANT-
ABILITY AND FITNESS. IN NO EVENT SHALL SECRET LABS AB OR THE AUTHOR
BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY
DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
OF THIS SOFTWARE.
Licenses for Software linked to
===============================
Note that the choice of GPL compatibility outlined above doesn't extend
to modules linked to particular libraries, since they change the
effective License of the module binary.
GNU Readline
------------
The 'readline' module makes use of GNU Readline.
The GNU Readline Library is free software; you can redistribute it
and/or modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2, or (at
your option) any later version.
On Debian systems, you can find the complete statement in
/usr/share/doc/readline-common/copyright'. A copy of the GNU General
Public License is available in /usr/share/common-licenses/GPL-2'.
OpenSSL
-------
The '_ssl' module makes use of OpenSSL.
The OpenSSL toolkit stays under a dual license, i.e. both the
conditions of the OpenSSL License and the original SSLeay license
apply to the toolkit. Actually both licenses are BSD-style Open
Source licenses. Note that both licenses are incompatible with
the GPL.
On Debian systems, you can find the complete license text in
/usr/share/doc/openssl/copyright'.
Files with other licenses than the Python License
-------------------------------------------------
Files: Include/dynamic_annotations.h
Files: Python/dynamic_annotations.c
Copyright: (c) 2008-2009, Google Inc.
License: Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Neither the name of Google Inc. nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Files: Include/unicodeobject.h
Copyright: (c) Corporation for National Research Initiatives.
Copyright: (c) 1999 by Secret Labs AB.
Copyright: (c) 1999 by Fredrik Lundh.
License: By obtaining, using, and/or copying this software and/or its
associated documentation, you agree that you have read, understood,
and will comply with the following terms and conditions:
Permission to use, copy, modify, and distribute this software and its
associated documentation for any purpose and without fee is hereby
granted, provided that the above copyright notice appears in all
copies, and that both that copyright notice and this permission notice
appear in supporting documentation, and that the name of Secret Labs
AB or the author not be used in advertising or publicity pertaining to
distribution of the software without specific, written prior
permission.
SECRET LABS AB AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO
THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS. IN NO EVENT SHALL SECRET LABS AB OR THE AUTHOR BE LIABLE FOR
ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Files: Lib/logging/*
Copyright: 2001-2010 by Vinay Sajip. All Rights Reserved.
License: Permission to use, copy, modify, and distribute this software and
its documentation for any purpose and without fee is hereby granted,
provided that the above copyright notice appear in all copies and that
both that copyright notice and this permission notice appear in
supporting documentation, and that the name of Vinay Sajip
not be used in advertising or publicity pertaining to distribution
of the software without specific, written prior permission.
VINAY SAJIP DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING
ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL
VINAY SAJIP BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR
ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER
IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Files: Lib/multiprocessing/*
Files: Modules/_multiprocessing/*
Copyright: (c) 2006-2008, R Oudkerk. All rights reserved.
License: Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
3. Neither the name of author nor the names of any contributors may be
used to endorse or promote products derived from this software
without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.
Files: Lib/sqlite3/*
Files: Modules/_sqlite/*
Copyright: (C) 2004-2005 Gerhard Häring <gh@ghaering.de>
License: This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Files: Lib/async*
Copyright: Copyright 1996 by Sam Rushing
License: Permission to use, copy, modify, and distribute this software and
its documentation for any purpose and without fee is hereby
granted, provided that the above copyright notice appear in all
copies and that both that copyright notice and this permission
notice appear in supporting documentation, and that the name of Sam
Rushing not be used in advertising or publicity pertaining to
distribution of the software without specific, written prior
permission.
SAM RUSHING DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE,
INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN
NO EVENT SHALL SAM RUSHING BE LIABLE FOR ANY SPECIAL, INDIRECT OR
CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS
OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Files: Lib/tarfile.py
Copyright: (C) 2002 Lars Gustaebel <lars@gustaebel.de>
License: Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
Files: Lib/turtle.py
Copyright: (C) 2006 - 2010 Gregor Lingl
License: This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
is copyright Gregor Lingl and licensed under a BSD-like license
Files: Modules/_ctypes/libffi/*
Copyright: Copyright (C) 1996-2011 Red Hat, Inc and others.
Copyright (C) 1996-2011 Anthony Green
Copyright (C) 1996-2010 Free Software Foundation, Inc
Copyright (c) 2003, 2004, 2006, 2007, 2008 Kaz Kojima
Copyright (c) 2010, 2011, Plausible Labs Cooperative , Inc.
Copyright (c) 2010 CodeSourcery
Copyright (c) 1998 Andreas Schwab
Copyright (c) 2000 Hewlett Packard Company
Copyright (c) 2009 Bradley Smith
Copyright (c) 2008 David Daney
Copyright (c) 2004 Simon Posnjak
Copyright (c) 2005 Axis Communications AB
Copyright (c) 1998 Cygnus Solutions
Copyright (c) 2004 Renesas Technology
Copyright (c) 2002, 2007 Bo Thorsen <bo@suse.de>
Copyright (c) 2002 Ranjit Mathew
Copyright (c) 2002 Roger Sayle
Copyright (c) 2000, 2007 Software AG
Copyright (c) 2003 Jakub Jelinek
Copyright (c) 2000, 2001 John Hornkvist
Copyright (c) 1998 Geoffrey Keating
Copyright (c) 2008 Björn König
License: Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
``Software''), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED ``AS IS'', WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.
Documentation:
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2, or (at your option) any
later version. A copy of the license is included in the
section entitled ``GNU General Public License''.
Files: Modules/_gestalt.c
Copyright: 1991-1997 by Stichting Mathematisch Centrum, Amsterdam.
License: Permission to use, copy, modify, and distribute this software and its
documentation for any purpose and without fee is hereby granted,
provided that the above copyright notice appear in all copies and that
both that copyright notice and this permission notice appear in
supporting documentation, and that the names of Stichting Mathematisch
Centrum or CWI not be used in advertising or publicity pertaining to
distribution of the software without specific, written prior permission.
STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO
THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE
FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Files: Modules/syslogmodule.c
Copyright: 1994 by Lance Ellinghouse
License: Permission to use, copy, modify, and distribute this software and its
documentation for any purpose and without fee is hereby granted,
provided that the above copyright notice appear in all copies and that
both that copyright notice and this permission notice appear in
supporting documentation, and that the name of Lance Ellinghouse
not be used in advertising or publicity pertaining to distribution
of the software without specific, written prior permission.
LANCE ELLINGHOUSE DISCLAIMS ALL WARRANTIES WITH REGARD TO
THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS, IN NO EVENT SHALL LANCE ELLINGHOUSE BE LIABLE FOR ANY SPECIAL,
INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING
FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION
WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Files: Modules/zlib/*
Copyright: (C) 1995-2010 Jean-loup Gailly and Mark Adler
License: This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Jean-loup Gailly Mark Adler
jloup@gzip.org madler@alumni.caltech.edu
If you use the zlib library in a product, we would appreciate *not* receiving
lengthy legal documents to sign. The sources are provided for free but without
warranty of any kind. The library has been entirely written by Jean-loup
Gailly and Mark Adler; it does not include third-party code.
Files: Modules/expat/*
Copyright: Copyright (c) 1998, 1999, 2000 Thai Open Source Software Center Ltd
and Clark Cooper
Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006 Expat maintainers
License: Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Files: Modules/_decimal/libmpdec/*
Copyright: Copyright (c) 2008-2012 Stefan Krah. All rights reserved.
License: Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
.
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
,
THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.
Files: Misc/python-mode.el
Copyright: Copyright (C) 1992,1993,1994 Tim Peters
License: This software is provided as-is, without express or implied
warranty. Permission to use, copy, modify, distribute or sell this
software, without fee, for any purpose and by any individual or
organization, is hereby granted, provided that the above copyright
notice and this paragraph appear in all copies.
Files: Python/dtoa.c
Copyright: (c) 1991, 2000, 2001 by Lucent Technologies.
License: Permission to use, copy, modify, and distribute this software for any
purpose without fee is hereby granted, provided that this entire notice
is included in all copies of any software which is or includes a copy
or modification of this software and in all copies of the supporting
documentation for such software.
THIS SOFTWARE IS BEING PROVIDED "AS IS", WITHOUT ANY EXPRESS OR IMPLIED
WARRANTY. IN PARTICULAR, NEITHER THE AUTHOR NOR LUCENT MAKES ANY
REPRESENTATION OR WARRANTY OF ANY KIND CONCERNING THE MERCHANTABILITY
OF THIS SOFTWARE OR ITS FITNESS FOR ANY PARTICULAR PURPOSE.
Files: Python/getopt.c
Copyright: 1992-1994, David Gottner
License: Permission to use, copy, modify, and distribute this software and its
documentation for any purpose and without fee is hereby granted,
provided that the above copyright notice, this permission notice and
the following disclaimer notice appear unmodified in all copies.
I DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL I
BE LIABLE FOR ANY SPECIAL, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY
DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA, OR PROFITS, WHETHER
IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Files: PC/_subprocess.c
Copyright: Copyright (c) 2004 by Fredrik Lundh <fredrik@pythonware.com>
Copyright (c) 2004 by Secret Labs AB, http://www.pythonware.com
Copyright (c) 2004 by Peter Astrand <astrand@lysator.liu.se>
License:
* Permission to use, copy, modify, and distribute this software and
* its associated documentation for any purpose and without fee is
* hereby granted, provided that the above copyright notice appears in
* all copies, and that both that copyright notice and this permission
* notice appear in supporting documentation, and that the name of the
* authors not be used in advertising or publicity pertaining to
* distribution of the software without specific, written prior
* permission.
*
* THE AUTHORS DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE,
* INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS.
* IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY SPECIAL, INDIRECT OR
* CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS
* OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
* NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION
* WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Files: PC/winsound.c
Copyright: Copyright (c) 1999 Toby Dickenson
License: * Permission to use this software in any way is granted without
* fee, provided that the copyright notice above appears in all
* copies. This software is provided "as is" without any warranty.
*/
/* Modified by Guido van Rossum */
/* Beep added by Mark Hammond */
/* Win9X Beep and platform identification added by Uncle Timmy */
Files: Tools/pybench/*
Copyright: (c), 1997-2006, Marc-Andre Lemburg (mal@lemburg.com)
(c), 2000-2006, eGenix.com Software GmbH (info@egenix.com)
License: Permission to use, copy, modify, and distribute this software and its
documentation for any purpose and without fee or royalty is hereby
granted, provided that the above copyright notice appear in all copies
and that both that copyright notice and this permission notice appear
in supporting documentation or portions thereof, including
modifications, that you make.
THE AUTHOR MARC-ANDRE LEMBURG DISCLAIMS ALL WARRANTIES WITH REGARD TO
THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS, IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL,
INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING
FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION
WITH THE USE OR PERFORMANCE OF THIS SOFTWARE !

View File

@@ -0,0 +1,3 @@
opt/scylladb/python3/bin
opt/scylladb/python3/lib64
opt/scylladb/python3/libexec

View File

@@ -0,0 +1,3 @@
bin/* opt/scylladb/python3/bin
lib64/* opt/scylladb/python3/lib64
libexec/* opt/scylladb/python3/libexec

22
dist/debian/python3/rules.mustache vendored Executable file
View File

@@ -0,0 +1,22 @@
#!/usr/bin/make -f
export PYBUILD_DISABLE=1
override_dh_auto_configure:
override_dh_auto_build:
override_dh_strip:
override_dh_makeshlibs:
override_dh_shlibdeps:
override_dh_fixperms:
dh_fixperms
chmod 755 $(CURDIR)/debian/{{product}}-python3/opt/scylladb/python3/libexec/ld.so
override_dh_strip_nondeterminism:
%:
dh $@

View File

@@ -9,12 +9,21 @@ override_dh_auto_build:
override_dh_auto_clean:
override_dh_auto_install:
dh_auto_install
override_dh_install:
dh_install
install -d $(CURDIR)/debian/scylla-server/usr/bin
for bin in debian/scylla-server/opt/scylladb/libexec/*; do debian/adjust_bin $(CURDIR)/debian/scylla-server "$${bin#*libexec/}"; done
ln -sf /opt/scylladb/bin/scylla $(CURDIR)/debian/scylla-server/usr/bin/scylla
ln -sf /opt/scylladb/bin/iotune $(CURDIR)/debian/scylla-server/usr/bin/iotune
ln -sf /usr/lib/scylla/scyllatop/scyllatop.py $(CURDIR)/debian/scylla-server/usr/bin/scyllatop
find ./dist/common/scripts -type f -exec ./relocate_python_scripts.py \
--installroot $(CURDIR)/debian/scylla-server/usr/lib/scylla/ --with-python3 "$(CURDIR)/debian/scylla-server/opt/scylladb/python3/bin/python3" {} +
./relocate_python_scripts.py \
--installroot $(CURDIR)/debian/scylla-server/usr/lib/scylla/ --with-python3 "$(CURDIR)/debian/scylla-server/opt/scylladb/python3/bin/python3" \
seastar/scripts/perftune.py seastar/scripts/seastar-addr2line seastar/scripts/perftune.py
./relocate_python_scripts.py \
--installroot $(CURDIR)/debian/scylla-server/usr/lib/scylla/scyllatop/ --with-python3 "$(CURDIR)/debian/scylla-server/opt/scylladb/python3/bin/python3" \
tools/scyllatop/scyllatop.py
override_dh_installinit:
{{#scylla}}
@@ -29,7 +38,9 @@ override_dh_installinit:
dh_installinit --no-start --name node-exporter
override_dh_strip:
dh_strip -Xlibprotobuf.so.15 -Xld.so --dbg-package={{product}}-server-dbg
# The binaries (ethtool...patchelf) don't pass dh_strip after going through patchelf. Since they are
# already stripped, nothing is lost if we exclude them, so that's what we do.
dh_strip -Xlibprotobuf.so.15 -Xld.so -Xethtool -Xgawk -Xgzip -Xhwloc-calc -Xhwloc-distrib -Xifconfig -Xlscpu -Xnetstat -Xpatchelf --dbg-package={{product}}-server-dbg
override_dh_makeshlibs:

View File

@@ -1,14 +1,9 @@
dist/common/limits.d/scylla.conf etc/security/limits.d
dist/common/scylla.d/*.conf etc/scylla.d
seastar/dpdk/usertools/dpdk-devbind.py usr/lib/scylla
seastar/scripts/perftune.py usr/lib/scylla
seastar/scripts/seastar-addr2line usr/lib/scylla
seastar/scripts/seastar-cpu-map.sh usr/lib/scylla
dist/common/scripts/* usr/lib/scylla
tools/scyllatop usr/lib/scylla
swagger-ui/dist usr/lib/scylla/swagger-ui
api/api-doc usr/lib/scylla/api
bin/* opt/scylladb/bin
libreloc/* opt/scylladb/libreloc
libexec/* opt/scylladb/libexec
dist/common/sbin/* usr/sbin
@@ -20,3 +15,4 @@ dist/common/systemd/scylla-housekeeping-restart.timer /lib/systemd/system
dist/common/systemd/scylla-fstrim.timer /lib/systemd/system
dist/debian/scripts/scylla_save_coredump usr/lib/scylla
dist/debian/scripts/scylla_delay_fstrim usr/lib/scylla
tools/scyllatop usr/lib/scylla

View File

@@ -28,7 +28,7 @@ ADD commandlineparser.py /commandlineparser.py
ADD docker-entrypoint.py /docker-entrypoint.py
ADD node_exporter_install /node_exporter_install
# Install Scylla:
RUN curl http://downloads.scylladb.com/rpm/unstable/centos/master/latest/scylla.repo -o /etc/yum.repos.d/scylla.repo && \
RUN curl http://downloads.scylladb.com/rpm/centos/scylla-3.1.repo -o /etc/yum.repos.d/scylla.repo && \
yum -y install epel-release && \
yum -y clean expire-cache && \
yum -y update && \

View File

@@ -192,7 +192,11 @@ future<> verification_error(fs::path path, const char* fstr, Args&&... args) {
// No other file types may exist.
future<> distributed_loader::verify_owner_and_mode(fs::path path) {
return file_stat(path.string(), follow_symlink::no).then([path = std::move(path)] (stat_data sd) {
if (sd.uid != geteuid()) {
// Under docker, we run with euid 0 and there is no reasonable way to enforce that the
// in-container uid will have the same uid as files mounted from outside the container. So
// just allow euid 0 as a special case. It should survive the file_accessible() checks below.
// See #4823.
if (geteuid() != 0 && sd.uid != geteuid()) {
return verification_error(std::move(path), "File not owned by current euid: {}. Owner is: {}", geteuid(), sd.uid);
}
switch (sd.type) {

View File

@@ -151,7 +151,7 @@ if __name__ == '__main__':
argp.add_argument('--user', '-u')
argp.add_argument('--password', '-p', default='none')
argp.add_argument('--node', default='127.0.0.1', help='Node to connect to.')
argp.add_argument('--port', default='9042', help='Port to connect to.')
argp.add_argument('--port', default=9042, help='Port to connect to.', type=int)
args = argp.parse_args()
res = validate_and_fix(args)

View File

@@ -23,6 +23,7 @@
#include "mutation_reader.hh"
#include "seastar/util/reference_wrapper.hh"
#include "clustering_ranges_walker.hh"
#include "schema_upgrader.hh"
#include <algorithm>
#include <boost/range/adaptor/transformed.hpp>
@@ -908,3 +909,7 @@ public:
flat_mutation_reader make_generating_reader(schema_ptr s, std::function<future<mutation_fragment_opt> ()> get_next_fragment) {
return make_flat_mutation_reader<generating_reader>(std::move(s), std::move(get_next_fragment));
}
void flat_mutation_reader::do_upgrade_schema(const schema_ptr& s) {
*this = transform(std::move(*this), schema_upgrader(s));
}

View File

@@ -326,6 +326,7 @@ private:
flat_mutation_reader() = default;
explicit operator bool() const noexcept { return bool(_impl); }
friend class optimized_optional<flat_mutation_reader>;
void do_upgrade_schema(const schema_ptr&);
public:
// Documented in mutation_reader::forwarding in mutation_reader.hh.
class partition_range_forwarding_tag;
@@ -474,6 +475,14 @@ public:
void move_buffer_content_to(impl& other) {
_impl->move_buffer_content_to(other);
}
// Causes this reader to conform to s.
// Multiple calls of upgrade_schema() compose, effects of prior calls on the stream are preserved.
void upgrade_schema(const schema_ptr& s) {
if (__builtin_expect(s != schema(), false)) {
do_upgrade_schema(s);
}
}
};
using flat_mutation_reader_opt = optimized_optional<flat_mutation_reader>;
@@ -576,8 +585,12 @@ class delegating_reader : public flat_mutation_reader::impl {
public:
delegating_reader(Underlying&& r) : impl(to_reference(r).schema()), _underlying(std::forward<Underlying>(r)) { }
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override {
return fill_buffer_from(to_reference(_underlying), timeout).then([this] (bool underlying_finished) {
_end_of_stream = underlying_finished;
if (is_buffer_full()) {
return make_ready_future<>();
}
return to_reference(_underlying).fill_buffer(timeout).then([this] {
_end_of_stream = to_reference(_underlying).is_end_of_stream();
to_reference(_underlying).move_buffer_content_to(*this);
});
}
virtual future<> fast_forward_to(position_range pr, db::timeout_clock::time_point timeout) override {

View File

@@ -22,6 +22,7 @@
#pragma once
#include "clocks-impl.hh"
#include "hashing.hh"
#include <seastar/core/lowres_clock.hh>
@@ -71,3 +72,51 @@ using ttl_opt = std::optional<gc_clock::duration>;
static constexpr gc_clock::duration max_ttl = gc_clock::duration{20 * 365 * 24 * 60 * 60};
std::ostream& operator<<(std::ostream& os, gc_clock::time_point tp);
template<>
struct appending_hash<gc_clock::time_point> {
template<typename Hasher>
void operator()(Hasher& h, gc_clock::time_point t) const {
// Remain backwards-compatible with the 32-bit duration::rep (refs #4460).
uint64_t d64 = t.time_since_epoch().count();
feed_hash(h, uint32_t(d64 & 0xffff'ffff));
uint32_t msb = d64 >> 32;
if (msb) {
feed_hash(h, msb);
}
}
};
namespace ser {
// Forward-declaration - defined in serializer.hh, to avoid including it here.
template <typename Output>
void serialize_gc_clock_duration_value(Output& out, int64_t value);
template <typename Input>
int64_t deserialize_gc_clock_duration_value(Input& in);
template <typename T>
struct serializer;
template <>
struct serializer<gc_clock::duration> {
template <typename Input>
static gc_clock::duration read(Input& in) {
return gc_clock::duration(deserialize_gc_clock_duration_value(in));
}
template <typename Output>
static void write(Output& out, gc_clock::duration d) {
serialize_gc_clock_duration_value(out, d.count());
}
template <typename Input>
static void skip(Input& in) {
read(in);
}
};
}

View File

@@ -481,8 +481,7 @@ future<> gossiper::apply_state_locally(std::map<inet_address, endpoint_state> ma
int local_generation = local_ep_state_ptr.get_heart_beat_state().get_generation();
int remote_generation = remote_state.get_heart_beat_state().get_generation();
logger.trace("{} local generation {}, remote generation {}", ep, local_generation, remote_generation);
// A node was removed with nodetool removenode can have a generation of 2
if (local_generation > 2 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
if (remote_generation > service::get_generation_number() + MAX_GENERATION_DIFFERENCE) {
// assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",
ep, local_generation, remote_generation);

View File

@@ -160,7 +160,9 @@ public:
static constexpr std::chrono::milliseconds INTERVAL{1000};
static constexpr std::chrono::hours A_VERY_LONG_TIME{24 * 3};
/** Maximimum difference in generation and version values we are willing to accept about a peer */
// Maximimum difference between remote generation value and generation
// value this node would get if this node were restarted that we are
// willing to accept about a peer.
static constexpr int64_t MAX_GENERATION_DIFFERENCE = 86400 * 365;
std::chrono::milliseconds fat_client_timeout;

View File

@@ -29,7 +29,7 @@ template <typename T> struct hasher_traits;
template <> struct hasher_traits<md5_hasher> { using impl_type = CryptoPP::Weak::MD5; };
template <> struct hasher_traits<sha256_hasher> { using impl_type = CryptoPP::SHA256; };
template <typename T, size_t size> struct hasher<T, size>::impl {
template <typename T, size_t size> struct cryptopp_hasher<T, size>::impl {
using impl_type = typename hasher_traits<T>::impl_type;
impl_type hash{};
@@ -53,35 +53,35 @@ template <typename T, size_t size> struct hasher<T, size>::impl {
}
};
template <typename T, size_t size> hasher<T, size>::hasher() : _impl(std::make_unique<impl>()) {}
template <typename T, size_t size> cryptopp_hasher<T, size>::cryptopp_hasher() : _impl(std::make_unique<impl>()) {}
template <typename T, size_t size> hasher<T, size>::~hasher() = default;
template <typename T, size_t size> cryptopp_hasher<T, size>::~cryptopp_hasher() = default;
template <typename T, size_t size> hasher<T, size>::hasher(hasher&& o) noexcept = default;
template <typename T, size_t size> cryptopp_hasher<T, size>::cryptopp_hasher(cryptopp_hasher&& o) noexcept = default;
template <typename T, size_t size> hasher<T, size>::hasher(const hasher& o) : _impl(std::make_unique<hasher<T, size>::impl>(*o._impl)) {}
template <typename T, size_t size> cryptopp_hasher<T, size>::cryptopp_hasher(const cryptopp_hasher& o) : _impl(std::make_unique<cryptopp_hasher<T, size>::impl>(*o._impl)) {}
template <typename T, size_t size> hasher<T, size>& hasher<T, size>::operator=(hasher&& o) noexcept = default;
template <typename T, size_t size> cryptopp_hasher<T, size>& cryptopp_hasher<T, size>::operator=(cryptopp_hasher&& o) noexcept = default;
template <typename T, size_t size> hasher<T, size>& hasher<T, size>::operator=(const hasher& o) {
_impl = std::make_unique<hasher<T, size>::impl>(*o._impl);
template <typename T, size_t size> cryptopp_hasher<T, size>& cryptopp_hasher<T, size>::operator=(const cryptopp_hasher& o) {
_impl = std::make_unique<cryptopp_hasher<T, size>::impl>(*o._impl);
return *this;
}
template <typename T, size_t size> bytes hasher<T, size>::finalize() { return _impl->finalize(); }
template <typename T, size_t size> bytes cryptopp_hasher<T, size>::finalize() { return _impl->finalize(); }
template <typename T, size_t size> std::array<uint8_t, size> hasher<T, size>::finalize_array() {
template <typename T, size_t size> std::array<uint8_t, size> cryptopp_hasher<T, size>::finalize_array() {
return _impl->finalize_array();
}
template <typename T, size_t size> void hasher<T, size>::update(const char* ptr, size_t length) { _impl->update(ptr, length); }
template <typename T, size_t size> void cryptopp_hasher<T, size>::update(const char* ptr, size_t length) { _impl->update(ptr, length); }
template <typename T, size_t size> bytes hasher<T, size>::calculate(const std::string_view& s) {
typename hasher<T, size>::impl::impl_type hash;
template <typename T, size_t size> bytes cryptopp_hasher<T, size>::calculate(const std::string_view& s) {
typename cryptopp_hasher<T, size>::impl::impl_type hash;
unsigned char digest[size];
hash.CalculateDigest(digest, reinterpret_cast<const unsigned char*>(s.data()), s.size());
return std::move(bytes{reinterpret_cast<const int8_t*>(digest), size});
}
template class hasher<md5_hasher, 16>;
template class hasher<sha256_hasher, 32>;
template class cryptopp_hasher<md5_hasher, 16>;
template class cryptopp_hasher<sha256_hasher, 32>;

View File

@@ -22,29 +22,30 @@
#pragma once
#include "bytes.hh"
#include "hashing.hh"
class md5_hasher;
template <typename T, size_t size> class hasher {
template <typename T, size_t size> class cryptopp_hasher : public hasher {
struct impl;
std::unique_ptr<impl> _impl;
public:
hasher();
~hasher();
hasher(hasher&&) noexcept;
hasher(const hasher&);
hasher& operator=(hasher&&) noexcept;
hasher& operator=(const hasher&);
cryptopp_hasher();
~cryptopp_hasher();
cryptopp_hasher(cryptopp_hasher&&) noexcept;
cryptopp_hasher(const cryptopp_hasher&);
cryptopp_hasher& operator=(cryptopp_hasher&&) noexcept;
cryptopp_hasher& operator=(const cryptopp_hasher&);
bytes finalize();
std::array<uint8_t, size> finalize_array();
void update(const char* ptr, size_t length);
void update(const char* ptr, size_t length) override;
// Use update and finalize to compute the hash over the full view.
static bytes calculate(const std::string_view& s);
};
class md5_hasher : public hasher<md5_hasher, 16> {};
class md5_hasher final : public cryptopp_hasher<md5_hasher, 16> {};
class sha256_hasher : public hasher<sha256_hasher, 32> {};
class sha256_hasher final : public cryptopp_hasher<sha256_hasher, 32> {};

View File

@@ -27,6 +27,7 @@
#include <seastar/core/byteorder.hh>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
#include <seastar/util/gcc6-concepts.hh>
//
// This hashing differs from std::hash<> in that it decouples knowledge about
@@ -41,24 +42,38 @@
// appending_hash<T> is machine-independent.
//
// The Hasher concept
struct Hasher {
void update(const char* ptr, size_t size);
GCC6_CONCEPT(
template<typename H>
concept bool Hasher() {
return requires(H& h, const char* ptr, size_t size) {
{ h.update(ptr, size) } -> void
};
}
)
class hasher {
public:
virtual ~hasher() = default;
virtual void update(const char* ptr, size_t size) = 0;
};
GCC6_CONCEPT(static_assert(Hasher<hasher>());)
template<typename T, typename Enable = void>
struct appending_hash;
template<typename Hasher, typename T, typename... Args>
template<typename H, typename T, typename... Args>
GCC6_CONCEPT(requires Hasher<H>())
inline
void feed_hash(Hasher& h, const T& value, Args&&... args) {
void feed_hash(H& h, const T& value, Args&&... args) {
appending_hash<T>()(h, value, std::forward<Args>(args)...);
};
template<typename T>
struct appending_hash<T, std::enable_if_t<std::is_arithmetic<T>::value>> {
template<typename Hasher>
void operator()(Hasher& h, T value) const {
template<typename H>
GCC6_CONCEPT(requires Hasher<H>())
void operator()(H& h, T value) const {
auto value_le = cpu_to_le(value);
h.update(reinterpret_cast<const char*>(&value_le), sizeof(T));
}
@@ -66,24 +81,27 @@ struct appending_hash<T, std::enable_if_t<std::is_arithmetic<T>::value>> {
template<>
struct appending_hash<bool> {
template<typename Hasher>
void operator()(Hasher& h, bool value) const {
template<typename H>
GCC6_CONCEPT(requires Hasher<H>())
void operator()(H& h, bool value) const {
feed_hash(h, static_cast<uint8_t>(value));
}
};
template<typename T>
struct appending_hash<T, std::enable_if_t<std::is_enum<T>::value>> {
template<typename Hasher>
void operator()(Hasher& h, const T& value) const {
template<typename H>
GCC6_CONCEPT(requires Hasher<H>())
void operator()(H& h, const T& value) const {
feed_hash(h, static_cast<std::underlying_type_t<T>>(value));
}
};
template<typename T>
struct appending_hash<std::optional<T>> {
template<typename Hasher>
void operator()(Hasher& h, const std::optional<T>& value) const {
template<typename H>
GCC6_CONCEPT(requires Hasher<H>())
void operator()(H& h, const std::optional<T>& value) const {
if (value) {
feed_hash(h, true);
feed_hash(h, *value);
@@ -95,8 +113,9 @@ struct appending_hash<std::optional<T>> {
template<size_t N>
struct appending_hash<char[N]> {
template<typename Hasher>
void operator()(Hasher& h, const char (&value) [N]) const {
template<typename H>
GCC6_CONCEPT(requires Hasher<H>())
void operator()(H& h, const char (&value) [N]) const {
feed_hash(h, N);
h.update(value, N);
}
@@ -104,8 +123,9 @@ struct appending_hash<char[N]> {
template<typename T>
struct appending_hash<std::vector<T>> {
template<typename Hasher>
void operator()(Hasher& h, const std::vector<T>& value) const {
template<typename H>
GCC6_CONCEPT(requires Hasher<H>())
void operator()(H& h, const std::vector<T>& value) const {
feed_hash(h, value.size());
for (auto&& v : value) {
appending_hash<T>()(h, v);
@@ -115,8 +135,9 @@ struct appending_hash<std::vector<T>> {
template<typename K, typename V>
struct appending_hash<std::map<K, V>> {
template<typename Hasher>
void operator()(Hasher& h, const std::map<K, V>& value) const {
template<typename H>
GCC6_CONCEPT(requires Hasher<H>())
void operator()(H& h, const std::map<K, V>& value) const {
feed_hash(h, value.size());
for (auto&& e : value) {
appending_hash<K>()(h, e.first);
@@ -127,8 +148,9 @@ struct appending_hash<std::map<K, V>> {
template<>
struct appending_hash<sstring> {
template<typename Hasher>
void operator()(Hasher& h, const sstring& v) const {
template<typename H>
GCC6_CONCEPT(requires Hasher<H>())
void operator()(H& h, const sstring& v) const {
feed_hash(h, v.size());
h.update(reinterpret_cast<const char*>(v.cbegin()), v.size() * sizeof(sstring::value_type));
}
@@ -136,8 +158,9 @@ struct appending_hash<sstring> {
template<>
struct appending_hash<std::string> {
template<typename Hasher>
void operator()(Hasher& h, const std::string& v) const {
template<typename H>
GCC6_CONCEPT(requires Hasher<H>())
void operator()(H& h, const std::string& v) const {
feed_hash(h, v.size());
h.update(reinterpret_cast<const char*>(v.data()), v.size() * sizeof(std::string::value_type));
}
@@ -145,16 +168,18 @@ struct appending_hash<std::string> {
template<typename T, typename R>
struct appending_hash<std::chrono::duration<T, R>> {
template<typename Hasher>
void operator()(Hasher& h, std::chrono::duration<T, R> v) const {
template<typename H>
GCC6_CONCEPT(requires Hasher<H>())
void operator()(H& h, std::chrono::duration<T, R> v) const {
feed_hash(h, v.count());
}
};
template<typename Clock, typename Duration>
struct appending_hash<std::chrono::time_point<Clock, Duration>> {
template<typename Hasher>
void operator()(Hasher& h, std::chrono::time_point<Clock, Duration> v) const {
template<typename H>
GCC6_CONCEPT(requires Hasher<H>())
void operator()(H& h, std::chrono::time_point<Clock, Duration> v) const {
feed_hash(h, v.time_since_epoch().count());
}
};

View File

@@ -26,6 +26,6 @@ class partition {
class reconcilable_result {
uint32_t row_count();
std::vector<partition> partitions();
utils::chunked_vector<partition> partitions();
query::short_read is_short_read() [[version 1.6]] = query::short_read::no;
};

View File

@@ -51,4 +51,10 @@ enum class stream_reason : uint8_t {
repair,
};
enum class stream_mutation_fragments_cmd : uint8_t {
error,
mutation_fragment_data,
end_of_stream,
};
}

View File

@@ -181,4 +181,10 @@ bool secondary_index_manager::is_index(const schema& s) const {
});
}
bool secondary_index_manager::is_global_index(const schema& s) const {
return boost::algorithm::any_of(_indices | boost::adaptors::map_values, [&s] (const index& i) {
return !i.metadata().local() && s.cf_name() == index_table_name(i.metadata().name());
});
}
}

View File

@@ -77,6 +77,7 @@ public:
std::vector<index> list_indexes() const;
bool is_index(view_ptr) const;
bool is_index(const schema& s) const;
bool is_global_index(const schema& s) const;
private:
void add_index(const index_metadata& im);
};

View File

@@ -155,6 +155,10 @@ void init_ms_fd_gossiper(sharded<gms::gossiper>& gossiper
to_string(seeds), listen_address_in, broadcast_address);
throw std::runtime_error("Use broadcast_address for seeds list");
}
if ((!cfg.replace_address_first_boot().empty() || !cfg.replace_address().empty()) && seeds.count(broadcast_address)) {
startlog.error("Bad configuration: replace-address and replace-address-first-boot are not allowed for seed nodes");
throw bad_configuration_error();
}
gossiper.local().set_seeds(seeds);
gossiper.invoke_on_all([cluster_name](gms::gossiper& g) {
g.set_cluster_name(cluster_name);

View File

@@ -75,6 +75,29 @@ while [ $# -gt 0 ]; do
esac
done
patchelf() {
# patchelf comes from the build system, so it needs the build system's ld.so and
# shared libraries. We can't use patchelf on patchelf itself, so invoke it via
# ld.so.
LD_LIBRARY_PATH="$PWD/libreloc" libreloc/ld.so libexec/patchelf "$@"
}
adjust_bin() {
local bin="$1"
# We could add --set-rpath too, but then debugedit (called by rpmbuild) barfs
# on the result. So use LD_LIBRARY_PATH in the thunk, below.
patchelf \
--set-interpreter "/opt/scylladb/libreloc/ld.so" \
"$root/opt/scylladb/libexec/$bin"
cat > "$root/opt/scylladb/bin/$bin" <<EOF
#!/bin/bash -e
export GNUTLS_SYSTEM_PRIORITY_FILE="\${GNUTLS_SYSTEM_PRIORITY_FILE-/opt/scylladb/libreloc/gnutls.config}"
export LD_LIBRARY_PATH="/opt/scylladb/libreloc"
exec -a "\$0" "/opt/scylladb/libexec/$bin" "\$@"
EOF
chmod +x "$root/opt/scylladb/bin/$bin"
}
rprefix="$root/$prefix"
retc="$root/etc"
rdoc="$rprefix/share/doc"
@@ -105,16 +128,13 @@ install -m644 dist/common/systemd/*.service -Dt "$rprefix"/lib/systemd/system
install -m644 dist/common/systemd/*.timer -Dt "$rprefix"/lib/systemd/system
install -m755 seastar/scripts/seastar-cpu-map.sh -Dt "$rprefix"/lib/scylla/
install -m755 seastar/dpdk/usertools/dpdk-devbind.py -Dt "$rprefix"/lib/scylla/
install -m755 bin/* -Dt "$root/opt/scylladb/bin"
install -m755 libreloc/* -Dt "$root/opt/scylladb/libreloc"
# some files in libexec are symlinks, which "install" dereferences
# use cp -P for the symlinks instead.
install -m755 libexec/*.bin -Dt "$root/opt/scylladb/libexec"
for f in libexec/*; do
if [[ "$f" != *.bin ]]; then
cp -P "$f" "$root/opt/scylladb/libexec"
fi
install -m755 libexec/* -Dt "$root/opt/scylladb/libexec"
for bin in libexec/*; do
adjust_bin "${bin#libexec/}"
done
install -m755 libreloc/* -Dt "$root/opt/scylladb/libreloc"
ln -srf "$root/opt/scylladb/bin/scylla" "$rprefix/bin/scylla"
ln -srf "$root/opt/scylladb/bin/iotune" "$rprefix/bin/iotune"
ln -srf "$rprefix/lib/scylla/scyllatop/scyllatop.py" "$rprefix/bin/scyllatop"

32
main.cc
View File

@@ -69,6 +69,7 @@
#include "sstables/sstables.hh"
#include "gms/feature_service.hh"
#include "distributed_loader.hh"
#include "serializer.hh"
namespace fs = std::filesystem;
@@ -340,15 +341,7 @@ int main(int ac, char** av) {
auto cfg = make_lw_shared<db::config>(ext);
auto init = app.get_options_description().add_options();
// If --version is requested, print it out and exit immediately to avoid
// Seastar-specific warnings that may occur when running the app
init("version", bpo::bool_switch(), "print version number and exit");
bpo::variables_map vm;
bpo::store(bpo::command_line_parser(ac, av).options(app.get_options_description()).allow_unregistered().run(), vm);
if (vm["version"].as<bool>()) {
fmt::print("{}\n", scylla_version());
return 0;
}
bpo::options_description deprecated("Deprecated options - ignored");
deprecated.add_options()
@@ -362,6 +355,15 @@ int main(int ac, char** av) {
configurable::append_all(*cfg, init);
cfg->add_options(init);
// If --version is requested, print it out and exit immediately to avoid
// Seastar-specific warnings that may occur when running the app
bpo::variables_map vm;
bpo::store(bpo::command_line_parser(ac, av).options(app.get_options_description()).allow_unregistered().run(), vm);
if (vm["version"].as<bool>()) {
fmt::print("{}\n", scylla_version());
return 0;
}
distributed<database> db;
seastar::sharded<service::cache_hitrate_calculator> cf_cache_hitrate_calculator;
debug::db = &db;
@@ -407,6 +409,11 @@ int main(int ac, char** av) {
read_config(opts, *cfg).get();
configurable::init_all(opts, *cfg, *ext).get();
// We're writing to a non-atomic variable here. But bool writes are atomic
// in all supported architectures, and some broadcast or other below
// will apply the required memory barriers anyway.
ser::gc_clock_using_3_1_0_serialization = cfg->enable_3_1_0_compatibility_mode();
logalloc::prime_segment_pool(memory::stats().total_memory(), memory::min_free_memory()).get();
logging::apply_settings(cfg->logging_settings(opts));
@@ -526,6 +533,9 @@ int main(int ac, char** av) {
if (opts.count("developer-mode")) {
smp::invoke_on_all([] { engine().set_strict_dma(false); }).get();
}
set_abort_on_internal_error(cfg->abort_on_internal_error());
supervisor::notify("creating tracing");
tracing::backend_registry tracing_backend_registry;
tracing::register_tracing_keyspace_backend(tracing_backend_registry);
@@ -916,8 +926,10 @@ int main(int ac, char** av) {
service::get_local_storage_service().drain_on_shutdown().get();
});
auto stop_view_builder = defer([] {
view_builder.stop().get();
auto stop_view_builder = defer([cfg] {
if (cfg->view_building()) {
view_builder.stop().get();
}
});
auto stop_compaction_manager = defer([&db] {

View File

@@ -23,7 +23,6 @@
#include "database.hh"
#include "frozen_mutation.hh"
#include "partition_snapshot_reader.hh"
#include "schema_upgrader.hh"
#include "partition_builder.hh"
void memtable::memtable_encoding_stats_collector::update_timestamp(api::timestamp_type ts) {
@@ -429,11 +428,8 @@ public:
bool digest_requested = _slice.options.contains<query::partition_slice::option::with_digest>();
auto mpsr = make_partition_snapshot_flat_reader(snp_schema, std::move(key_and_snp->first), std::move(cr),
std::move(key_and_snp->second), digest_requested, region(), read_section(), mtbl(), streamed_mutation::forwarding::no);
if (snp_schema->version() != schema()->version()) {
_delegate = transform(std::move(mpsr), schema_upgrader(schema()));
} else {
_delegate = std::move(mpsr);
}
mpsr.upgrade_schema(schema());
_delegate = std::move(mpsr);
} else {
_end_of_stream = true;
}
@@ -588,11 +584,8 @@ private:
auto snp_schema = key_and_snp->second->schema();
auto mpsr = make_partition_snapshot_flat_reader<partition_snapshot_accounter>(snp_schema, std::move(key_and_snp->first), std::move(cr),
std::move(key_and_snp->second), false, region(), read_section(), mtbl(), streamed_mutation::forwarding::no, *snp_schema, _flushed_memory);
if (snp_schema->version() != schema()->version()) {
_partition_reader = transform(std::move(mpsr), schema_upgrader(schema()));
} else {
_partition_reader = std::move(mpsr);
}
mpsr.upgrade_schema(schema());
_partition_reader = std::move(mpsr);
}
}
public:
@@ -668,11 +661,8 @@ memtable::make_flat_reader(schema_ptr s,
bool digest_requested = slice.options.contains<query::partition_slice::option::with_digest>();
auto rd = make_partition_snapshot_flat_reader(snp_schema, std::move(dk), std::move(cr), std::move(snp), digest_requested,
*this, _read_section, shared_from_this(), fwd);
if (snp_schema->version() != s->version()) {
return transform(std::move(rd), schema_upgrader(s));
} else {
return rd;
}
rd.upgrade_schema(s);
return rd;
} else {
auto res = make_flat_mutation_reader<scanning_reader>(std::move(s), shared_from_this(), range, slice, pc, fwd_mr);
if (fwd == streamed_mutation::forwarding::yes) {
@@ -787,13 +777,19 @@ bool memtable::is_flushed() const {
return bool(_underlying);
}
void memtable_entry::upgrade_schema(const schema_ptr& s, mutation_cleaner& cleaner) {
if (_schema != s) {
partition().upgrade(_schema, s, cleaner, no_cache_tracker);
_schema = s;
}
}
void memtable::upgrade_entry(memtable_entry& e) {
if (e._schema != _schema) {
assert(!reclaiming_enabled());
with_allocator(allocator(), [this, &e] {
with_linearized_managed_bytes([&] {
e.partition().upgrade(e._schema, _schema, cleaner(), no_cache_tracker);
e._schema = _schema;
e.upgrade_schema(_schema, cleaner());
});
});
}

View File

@@ -69,6 +69,10 @@ public:
schema_ptr& schema() { return _schema; }
partition_snapshot_ptr snapshot(memtable& mtbl);
// Makes the entry conform to given schema.
// Must be called under allocating section of the region which owns the entry.
void upgrade_schema(const schema_ptr&, mutation_cleaner&);
size_t external_memory_usage_without_rows() const {
return _key.key().external_memory_usage();
}

View File

@@ -89,6 +89,7 @@
#include "frozen_mutation.hh"
#include "flat_mutation_reader.hh"
#include "streaming/stream_manager.hh"
#include "streaming/stream_mutation_fragments_cmd.hh"
namespace netw {
@@ -287,7 +288,6 @@ void messaging_service::start_listen() {
if (_compress_what != compress_what::none) {
so.compressor_factory = &compressor_factory;
}
so.streaming_domain = rpc::streaming_domain_type(0x55AA);
so.load_balancing_algorithm = server_socket::load_balancing_algorithm::port;
// FIXME: we don't set so.tcp_nodelay, because we can't tell at this point whether the connection will come from a
@@ -295,19 +295,21 @@ void messaging_service::start_listen() {
// the first by wrapping its server_socket, but not the second.
auto limits = rpc_resource_limits(_mcfg.rpc_memory_limit);
if (!_server[0]) {
auto listen = [&] (const gms::inet_address& a) {
auto listen = [&] (const gms::inet_address& a, rpc::streaming_domain_type sdomain) {
so.streaming_domain = sdomain;
auto addr = ipv4_addr{a.raw_addr(), _port};
return std::unique_ptr<rpc_protocol_server_wrapper>(new rpc_protocol_server_wrapper(*_rpc,
so, addr, limits));
};
_server[0] = listen(_listen_address);
_server[0] = listen(_listen_address, rpc::streaming_domain_type(0x55AA));
if (listen_to_bc) {
_server[1] = listen(utils::fb_utilities::get_broadcast_address());
_server[1] = listen(utils::fb_utilities::get_broadcast_address(), rpc::streaming_domain_type(0x66BB));
}
}
if (!_server_tls[0]) {
auto listen = [&] (const gms::inet_address& a) {
auto listen = [&] (const gms::inet_address& a, rpc::streaming_domain_type sdomain) {
so.streaming_domain = sdomain;
return std::unique_ptr<rpc_protocol_server_wrapper>(
[this, &so, &a, limits] () -> std::unique_ptr<rpc_protocol_server_wrapper>{
if (_encrypt_what == encrypt_what::none) {
@@ -321,9 +323,9 @@ void messaging_service::start_listen() {
so, seastar::tls::listen(_credentials, addr, lo), limits);
}());
};
_server_tls[0] = listen(_listen_address);
_server_tls[0] = listen(_listen_address, rpc::streaming_domain_type(0x77CC));
if (listen_to_bc) {
_server_tls[1] = listen(utils::fb_utilities::get_broadcast_address());
_server_tls[1] = listen(utils::fb_utilities::get_broadcast_address(), rpc::streaming_domain_type(0x88DD));
}
}
// Do this on just cpu 0, to avoid duplicate logs.
@@ -607,6 +609,7 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
opts.compressor_factory = &compressor_factory;
}
opts.tcp_nodelay = must_tcp_nodelay;
opts.reuseaddr = true;
auto client = must_encrypt ?
::make_shared<rpc_protocol_client_wrapper>(*_rpc, std::move(opts),
@@ -668,24 +671,24 @@ std::unique_ptr<messaging_service::rpc_protocol_wrapper>& messaging_service::rpc
return _rpc;
}
rpc::sink<int32_t> messaging_service::make_sink_for_stream_mutation_fragments(rpc::source<frozen_mutation_fragment>& source) {
rpc::sink<int32_t> messaging_service::make_sink_for_stream_mutation_fragments(rpc::source<frozen_mutation_fragment, rpc::optional<streaming::stream_mutation_fragments_cmd>>& source) {
return source.make_sink<netw::serializer, int32_t>();
}
future<rpc::sink<frozen_mutation_fragment>, rpc::source<int32_t>>
future<rpc::sink<frozen_mutation_fragment, streaming::stream_mutation_fragments_cmd>, rpc::source<int32_t>>
messaging_service::make_sink_and_source_for_stream_mutation_fragments(utils::UUID schema_id, utils::UUID plan_id, utils::UUID cf_id, uint64_t estimated_partitions, streaming::stream_reason reason, msg_addr id) {
auto rpc_client = get_rpc_client(messaging_verb::STREAM_MUTATION_FRAGMENTS, id);
return rpc_client->make_stream_sink<netw::serializer, frozen_mutation_fragment>().then([this, plan_id, schema_id, cf_id, estimated_partitions, reason, rpc_client] (rpc::sink<frozen_mutation_fragment> sink) mutable {
auto rpc_handler = rpc()->make_client<rpc::source<int32_t> (utils::UUID, utils::UUID, utils::UUID, uint64_t, streaming::stream_reason, rpc::sink<frozen_mutation_fragment>)>(messaging_verb::STREAM_MUTATION_FRAGMENTS);
return rpc_client->make_stream_sink<netw::serializer, frozen_mutation_fragment, streaming::stream_mutation_fragments_cmd>().then([this, plan_id, schema_id, cf_id, estimated_partitions, reason, rpc_client] (rpc::sink<frozen_mutation_fragment, streaming::stream_mutation_fragments_cmd> sink) mutable {
auto rpc_handler = rpc()->make_client<rpc::source<int32_t> (utils::UUID, utils::UUID, utils::UUID, uint64_t, streaming::stream_reason, rpc::sink<frozen_mutation_fragment, streaming::stream_mutation_fragments_cmd>)>(messaging_verb::STREAM_MUTATION_FRAGMENTS);
return rpc_handler(*rpc_client , plan_id, schema_id, cf_id, estimated_partitions, reason, sink).then_wrapped([sink, rpc_client] (future<rpc::source<int32_t>> source) mutable {
return (source.failed() ? sink.close() : make_ready_future<>()).then([sink = std::move(sink), source = std::move(source)] () mutable {
return make_ready_future<rpc::sink<frozen_mutation_fragment>, rpc::source<int32_t>>(std::move(sink), std::move(source.get0()));
return make_ready_future<rpc::sink<frozen_mutation_fragment, streaming::stream_mutation_fragments_cmd>, rpc::source<int32_t>>(std::move(sink), std::move(source.get0()));
});
});
});
}
void messaging_service::register_stream_mutation_fragments(std::function<future<rpc::sink<int32_t>> (const rpc::client_info& cinfo, UUID plan_id, UUID schema_id, UUID cf_id, uint64_t estimated_partitions, rpc::optional<streaming::stream_reason>, rpc::source<frozen_mutation_fragment> source)>&& func) {
void messaging_service::register_stream_mutation_fragments(std::function<future<rpc::sink<int32_t>> (const rpc::client_info& cinfo, UUID plan_id, UUID schema_id, UUID cf_id, uint64_t estimated_partitions, rpc::optional<streaming::stream_reason>, rpc::source<frozen_mutation_fragment, rpc::optional<streaming::stream_mutation_fragments_cmd>> source)>&& func) {
register_handler(this, messaging_verb::STREAM_MUTATION_FRAGMENTS, std::move(func));
}
@@ -1077,14 +1080,14 @@ future<> messaging_service::send_repair_put_row_diff(msg_addr id, uint32_t repai
}
// Wrapper for REPAIR_ROW_LEVEL_START
void messaging_service::register_repair_row_level_start(std::function<future<> (const rpc::client_info& cinfo, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name)>&& func) {
void messaging_service::register_repair_row_level_start(std::function<future<> (const rpc::client_info& cinfo, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version)>&& func) {
register_handler(this, messaging_verb::REPAIR_ROW_LEVEL_START, std::move(func));
}
void messaging_service::unregister_repair_row_level_start() {
_rpc->unregister_handler(messaging_verb::REPAIR_ROW_LEVEL_START);
}
future<> messaging_service::send_repair_row_level_start(msg_addr id, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name) {
return send_message<void>(this, messaging_verb::REPAIR_ROW_LEVEL_START, std::move(id), repair_meta_id, std::move(keyspace_name), std::move(cf_name), std::move(range), algo, max_row_buf_size, seed, remote_shard, remote_shard_count, remote_ignore_msb, std::move(remote_partitioner_name));
future<> messaging_service::send_repair_row_level_start(msg_addr id, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version) {
return send_message<void>(this, messaging_verb::REPAIR_ROW_LEVEL_START, std::move(id), repair_meta_id, std::move(keyspace_name), std::move(cf_name), std::move(range), algo, max_row_buf_size, seed, remote_shard, remote_shard_count, remote_ignore_msb, std::move(remote_partitioner_name), std::move(schema_version));
}
// Wrapper for REPAIR_ROW_LEVEL_STOP

View File

@@ -36,6 +36,7 @@
#include "tracing/tracing.hh"
#include "digest_algorithm.hh"
#include "streaming/stream_reason.hh"
#include "streaming/stream_mutation_fragments_cmd.hh"
#include "cache_temperature.hh"
#include <list>
@@ -270,9 +271,9 @@ public:
// Wrapper for STREAM_MUTATION_FRAGMENTS
// The receiver of STREAM_MUTATION_FRAGMENTS sends status code to the sender to notify any error on the receiver side. The status code is of type int32_t. 0 means successful, -1 means error, other status code value are reserved for future use.
void register_stream_mutation_fragments(std::function<future<rpc::sink<int32_t>> (const rpc::client_info& cinfo, UUID plan_id, UUID schema_id, UUID cf_id, uint64_t estimated_partitions, rpc::optional<streaming::stream_reason> reason_opt, rpc::source<frozen_mutation_fragment> source)>&& func);
rpc::sink<int32_t> make_sink_for_stream_mutation_fragments(rpc::source<frozen_mutation_fragment>& source);
future<rpc::sink<frozen_mutation_fragment>, rpc::source<int32_t>> make_sink_and_source_for_stream_mutation_fragments(utils::UUID schema_id, utils::UUID plan_id, utils::UUID cf_id, uint64_t estimated_partitions, streaming::stream_reason reason, msg_addr id);
void register_stream_mutation_fragments(std::function<future<rpc::sink<int32_t>> (const rpc::client_info& cinfo, UUID plan_id, UUID schema_id, UUID cf_id, uint64_t estimated_partitions, rpc::optional<streaming::stream_reason> reason_opt, rpc::source<frozen_mutation_fragment, rpc::optional<streaming::stream_mutation_fragments_cmd>> source)>&& func);
rpc::sink<int32_t> make_sink_for_stream_mutation_fragments(rpc::source<frozen_mutation_fragment, rpc::optional<streaming::stream_mutation_fragments_cmd>>& source);
future<rpc::sink<frozen_mutation_fragment, streaming::stream_mutation_fragments_cmd>, rpc::source<int32_t>> make_sink_and_source_for_stream_mutation_fragments(utils::UUID schema_id, utils::UUID plan_id, utils::UUID cf_id, uint64_t estimated_partitions, streaming::stream_reason reason, msg_addr id);
void register_stream_mutation_done(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, dht::token_range_vector ranges, UUID cf_id, unsigned dst_cpu_id)>&& func);
future<> send_stream_mutation_done(msg_addr id, UUID plan_id, dht::token_range_vector ranges, UUID cf_id, unsigned dst_cpu_id);
@@ -311,9 +312,9 @@ public:
future<> send_repair_put_row_diff(msg_addr id, uint32_t repair_meta_id, repair_rows_on_wire row_diff);
// Wrapper for REPAIR_ROW_LEVEL_START
void register_repair_row_level_start(std::function<future<> (const rpc::client_info& cinfo, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name)>&& func);
void register_repair_row_level_start(std::function<future<> (const rpc::client_info& cinfo, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version)>&& func);
void unregister_repair_row_level_start();
future<> send_repair_row_level_start(msg_addr id, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name);
future<> send_repair_row_level_start(msg_addr id, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version);
// Wrapper for REPAIR_ROW_LEVEL_STOP
void register_repair_row_level_stop(std::function<future<> (const rpc::client_info& cinfo, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range)>&& func);

View File

@@ -145,7 +145,14 @@ mutation_partition::mutation_partition(const schema& s, const mutation_partition
, _static_row(s, column_kind::static_column, x._static_row)
, _static_row_continuous(x._static_row_continuous)
, _rows()
, _row_tombstones(x._row_tombstones) {
, _row_tombstones(x._row_tombstones)
#ifdef SEASTAR_DEBUG
, _schema_version(s.version())
#endif
{
#ifdef SEASTAR_DEBUG
assert(x._schema_version == _schema_version);
#endif
auto cloner = [&s] (const auto& x) {
return current_allocator().construct<rows_entry>(s, x);
};
@@ -158,7 +165,14 @@ mutation_partition::mutation_partition(const mutation_partition& x, const schema
, _static_row(schema, column_kind::static_column, x._static_row)
, _static_row_continuous(x._static_row_continuous)
, _rows()
, _row_tombstones(x._row_tombstones, range_tombstone_list::copy_comparator_only()) {
, _row_tombstones(x._row_tombstones, range_tombstone_list::copy_comparator_only())
#ifdef SEASTAR_DEBUG
, _schema_version(schema.version())
#endif
{
#ifdef SEASTAR_DEBUG
assert(x._schema_version == _schema_version);
#endif
try {
for(auto&& r : ck_ranges) {
for (const rows_entry& e : x.range(schema, r)) {
@@ -181,7 +195,13 @@ mutation_partition::mutation_partition(mutation_partition&& x, const schema& sch
, _static_row_continuous(x._static_row_continuous)
, _rows(std::move(x._rows))
, _row_tombstones(std::move(x._row_tombstones))
#ifdef SEASTAR_DEBUG
, _schema_version(schema.version())
#endif
{
#ifdef SEASTAR_DEBUG
assert(x._schema_version == _schema_version);
#endif
{
auto deleter = current_deleter<rows_entry>();
auto it = _rows.begin();
@@ -221,6 +241,7 @@ mutation_partition::operator=(mutation_partition&& x) noexcept {
}
void mutation_partition::ensure_last_dummy(const schema& s) {
check_schema(s);
if (_rows.empty() || !_rows.rbegin()->is_last_dummy()) {
_rows.insert_before(_rows.end(),
*current_allocator().construct<rows_entry>(s, rows_entry::last_dummy_tag(), is_continuous::yes));
@@ -277,11 +298,16 @@ void deletable_row::apply(const schema& s, clustering_row cr) {
void
mutation_partition::apply(const schema& s, const mutation_fragment& mf) {
check_schema(s);
mutation_fragment_applier applier{s, *this};
mf.visit(applier);
}
stop_iteration mutation_partition::apply_monotonically(const schema& s, mutation_partition&& p, cache_tracker* tracker, is_preemptible preemptible) {
#ifdef SEASTAR_DEBUG
assert(s.version() == _schema_version);
assert(p._schema_version == _schema_version);
#endif
_tombstone.apply(p._tombstone);
_static_row.apply_monotonically(s, column_kind::static_column, std::move(p._static_row));
_static_row_continuous |= p._static_row_continuous;
@@ -387,6 +413,7 @@ void mutation_partition::apply_weak(const schema& s, mutation_partition&& p) {
tombstone
mutation_partition::range_tombstone_for_row(const schema& schema, const clustering_key& key) const {
check_schema(schema);
tombstone t = _tombstone;
if (!_row_tombstones.empty()) {
auto found = _row_tombstones.search_tombstone_covering(schema, key);
@@ -397,6 +424,7 @@ mutation_partition::range_tombstone_for_row(const schema& schema, const clusteri
row_tombstone
mutation_partition::tombstone_for_row(const schema& schema, const clustering_key& key) const {
check_schema(schema);
row_tombstone t = row_tombstone(range_tombstone_for_row(schema, key));
auto j = _rows.find(key, rows_entry::compare(schema));
@@ -409,6 +437,7 @@ mutation_partition::tombstone_for_row(const schema& schema, const clustering_key
row_tombstone
mutation_partition::tombstone_for_row(const schema& schema, const rows_entry& e) const {
check_schema(schema);
row_tombstone t = e.row().deleted_at();
t.apply(range_tombstone_for_row(schema, e.key()));
return t;
@@ -416,6 +445,7 @@ mutation_partition::tombstone_for_row(const schema& schema, const rows_entry& e)
void
mutation_partition::apply_row_tombstone(const schema& schema, clustering_key_prefix prefix, tombstone t) {
check_schema(schema);
assert(!prefix.is_full(schema));
auto start = prefix;
_row_tombstones.apply(schema, {std::move(start), std::move(prefix), std::move(t)});
@@ -423,11 +453,13 @@ mutation_partition::apply_row_tombstone(const schema& schema, clustering_key_pre
void
mutation_partition::apply_row_tombstone(const schema& schema, range_tombstone rt) {
check_schema(schema);
_row_tombstones.apply(schema, std::move(rt));
}
void
mutation_partition::apply_delete(const schema& schema, const clustering_key_prefix& prefix, tombstone t) {
check_schema(schema);
if (prefix.is_empty(schema)) {
apply(t);
} else if (prefix.is_full(schema)) {
@@ -439,6 +471,7 @@ mutation_partition::apply_delete(const schema& schema, const clustering_key_pref
void
mutation_partition::apply_delete(const schema& schema, range_tombstone rt) {
check_schema(schema);
if (range_tombstone::is_single_clustering_row_tombstone(schema, rt.start, rt.start_kind, rt.end, rt.end_kind)) {
apply_delete(schema, std::move(rt.start), std::move(rt.tomb));
return;
@@ -448,6 +481,7 @@ mutation_partition::apply_delete(const schema& schema, range_tombstone rt) {
void
mutation_partition::apply_delete(const schema& schema, clustering_key&& prefix, tombstone t) {
check_schema(schema);
if (prefix.is_empty(schema)) {
apply(t);
} else if (prefix.is_full(schema)) {
@@ -459,6 +493,7 @@ mutation_partition::apply_delete(const schema& schema, clustering_key&& prefix,
void
mutation_partition::apply_delete(const schema& schema, clustering_key_prefix_view prefix, tombstone t) {
check_schema(schema);
if (prefix.is_empty(schema)) {
apply(t);
} else if (prefix.is_full(schema)) {
@@ -484,6 +519,7 @@ void mutation_partition::insert_row(const schema& s, const clustering_key& key,
}
void mutation_partition::insert_row(const schema& s, const clustering_key& key, const deletable_row& row) {
check_schema(s);
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(s, key, row));
_rows.insert(_rows.end(), *e, rows_entry::compare(s));
@@ -492,6 +528,7 @@ void mutation_partition::insert_row(const schema& s, const clustering_key& key,
const row*
mutation_partition::find_row(const schema& s, const clustering_key& key) const {
check_schema(s);
auto i = _rows.find(key, rows_entry::compare(s));
if (i == _rows.end()) {
return nullptr;
@@ -501,6 +538,7 @@ mutation_partition::find_row(const schema& s, const clustering_key& key) const {
deletable_row&
mutation_partition::clustered_row(const schema& s, clustering_key&& key) {
check_schema(s);
auto i = _rows.find(key, rows_entry::compare(s));
if (i == _rows.end()) {
auto e = alloc_strategy_unique_ptr<rows_entry>(
@@ -513,6 +551,7 @@ mutation_partition::clustered_row(const schema& s, clustering_key&& key) {
deletable_row&
mutation_partition::clustered_row(const schema& s, const clustering_key& key) {
check_schema(s);
auto i = _rows.find(key, rows_entry::compare(s));
if (i == _rows.end()) {
auto e = alloc_strategy_unique_ptr<rows_entry>(
@@ -525,6 +564,7 @@ mutation_partition::clustered_row(const schema& s, const clustering_key& key) {
deletable_row&
mutation_partition::clustered_row(const schema& s, clustering_key_view key) {
check_schema(s);
auto i = _rows.find(key, rows_entry::compare(s));
if (i == _rows.end()) {
auto e = alloc_strategy_unique_ptr<rows_entry>(
@@ -537,6 +577,7 @@ mutation_partition::clustered_row(const schema& s, clustering_key_view key) {
deletable_row&
mutation_partition::clustered_row(const schema& s, position_in_partition_view pos, is_dummy dummy, is_continuous continuous) {
check_schema(s);
auto i = _rows.find(pos, rows_entry::compare(s));
if (i == _rows.end()) {
auto e = alloc_strategy_unique_ptr<rows_entry>(
@@ -549,6 +590,7 @@ mutation_partition::clustered_row(const schema& s, position_in_partition_view po
mutation_partition::rows_type::const_iterator
mutation_partition::lower_bound(const schema& schema, const query::clustering_range& r) const {
check_schema(schema);
if (!r.start()) {
return std::cbegin(_rows);
}
@@ -557,6 +599,7 @@ mutation_partition::lower_bound(const schema& schema, const query::clustering_ra
mutation_partition::rows_type::const_iterator
mutation_partition::upper_bound(const schema& schema, const query::clustering_range& r) const {
check_schema(schema);
if (!r.end()) {
return std::cend(_rows);
}
@@ -565,6 +608,7 @@ mutation_partition::upper_bound(const schema& schema, const query::clustering_ra
boost::iterator_range<mutation_partition::rows_type::const_iterator>
mutation_partition::range(const schema& schema, const query::clustering_range& r) const {
check_schema(schema);
return boost::make_iterator_range(lower_bound(schema, r), upper_bound(schema, r));
}
@@ -601,6 +645,7 @@ mutation_partition::upper_bound(const schema& schema, const query::clustering_ra
template<typename Func>
void mutation_partition::for_each_row(const schema& schema, const query::clustering_range& row_range, bool reversed, Func&& func) const
{
check_schema(schema);
auto r = range(schema, row_range);
if (!reversed) {
for (const auto& e : r) {
@@ -817,6 +862,7 @@ bool has_any_live_data(const schema& s, column_kind kind, const row& cells, tomb
void
mutation_partition::query_compacted(query::result::partition_writer& pw, const schema& s, uint32_t limit) const {
check_schema(s);
const query::partition_slice& slice = pw.slice();
max_timestamp max_ts{pw.last_modified()};
@@ -1049,6 +1095,10 @@ bool mutation_partition::equal(const schema& s, const mutation_partition& p) con
}
bool mutation_partition::equal(const schema& this_schema, const mutation_partition& p, const schema& p_schema) const {
#ifdef SEASTAR_DEBUG
assert(_schema_version == this_schema.version());
assert(p._schema_version == p_schema.version());
#endif
if (_tombstone != p._tombstone) {
return false;
}
@@ -1177,6 +1227,7 @@ row::apply_monotonically(const column_definition& column, atomic_cell_or_collect
void
row::append_cell(column_id id, atomic_cell_or_collection value) {
if (_type == storage_type::vector && id < max_vector_size) {
assert(_storage.vector.v.size() <= id);
_storage.vector.v.resize(id);
_storage.vector.v.emplace_back(cell_and_hash{std::move(value), cell_hash_opt()});
_storage.vector.present.set(id);
@@ -1241,6 +1292,7 @@ size_t rows_entry::memory_usage(const schema& s) const {
}
size_t mutation_partition::external_memory_usage(const schema& s) const {
check_schema(s);
size_t sum = 0;
sum += static_row().external_memory_usage(s, column_kind::static_column);
for (auto& clr : clustered_rows()) {
@@ -1259,6 +1311,7 @@ void mutation_partition::trim_rows(const schema& s,
const std::vector<query::clustering_range>& row_ranges,
Func&& func)
{
check_schema(s);
static_assert(std::is_same<stop_iteration, std::result_of_t<Func(rows_entry&)>>::value, "Bad func signature");
stop_iteration stop = stop_iteration::no;
@@ -1303,6 +1356,7 @@ uint32_t mutation_partition::do_compact(const schema& s,
uint32_t row_limit,
can_gc_fn& can_gc)
{
check_schema(s);
assert(row_limit > 0);
auto gc_before = saturating_subtract(query_time, s.gc_grace_seconds());
@@ -1368,12 +1422,14 @@ mutation_partition::compact_for_query(
bool reverse,
uint32_t row_limit)
{
check_schema(s);
return do_compact(s, query_time, row_ranges, reverse, row_limit, always_gc);
}
void mutation_partition::compact_for_compaction(const schema& s,
can_gc_fn& can_gc, gc_clock::time_point compaction_time)
{
check_schema(s);
static const std::vector<query::clustering_range> all_rows = {
query::clustering_range::make_open_ended_both_sides()
};
@@ -1407,11 +1463,13 @@ row::is_live(const schema& s, column_kind kind, tombstone base_tombstone, gc_clo
bool
mutation_partition::is_static_row_live(const schema& s, gc_clock::time_point query_time) const {
check_schema(s);
return has_any_live_data(s, column_kind::static_column, static_row(), _tombstone, query_time);
}
size_t
mutation_partition::live_row_count(const schema& s, gc_clock::time_point query_time) const {
check_schema(s);
size_t count = 0;
for (const rows_entry& e : non_dummy_rows()) {
@@ -1757,6 +1815,7 @@ row row::difference(const schema& s, column_kind kind, const row& other) const
mutation_partition mutation_partition::difference(schema_ptr s, const mutation_partition& other) const
{
check_schema(*s);
mutation_partition mp(s);
if (_tombstone > other._tombstone) {
mp.apply(_tombstone);
@@ -1787,6 +1846,7 @@ mutation_partition mutation_partition::difference(schema_ptr s, const mutation_p
}
void mutation_partition::accept(const schema& s, mutation_partition_visitor& v) const {
check_schema(s);
v.accept_partition_tombstone(_tombstone);
_static_row.for_each_cell([&] (column_id id, const atomic_cell_or_collection& cell) {
const column_definition& def = s.static_column_at(id);
@@ -2200,6 +2260,9 @@ mutation_partition::mutation_partition(mutation_partition::incomplete_tag, const
, _static_row_continuous(!s.has_static_columns())
, _rows()
, _row_tombstones(s)
#ifdef SEASTAR_DEBUG
, _schema_version(s.version())
#endif
{
_rows.insert_before(_rows.end(),
*current_allocator().construct<rows_entry>(s, rows_entry::last_dummy_tag(), is_continuous::no));
@@ -2265,6 +2328,7 @@ void mutation_partition::set_continuity(const schema& s, const position_range& p
}
clustering_interval_set mutation_partition::get_continuity(const schema& s, is_continuous cont) const {
check_schema(s);
clustering_interval_set result;
auto i = _rows.begin();
auto prev_pos = position_in_partition::before_all_clustered_rows();
@@ -2314,6 +2378,7 @@ stop_iteration mutation_partition::clear_gently(cache_tracker* tracker) noexcept
bool
mutation_partition::check_continuity(const schema& s, const position_range& r, is_continuous cont) const {
check_schema(s);
auto less = rows_entry::compare(s);
auto i = _rows.lower_bound(r.start(), less);
auto end = _rows.lower_bound(r.end(), less);

View File

@@ -397,7 +397,7 @@ public:
if (is_missing() || _ttl == dead) {
return false;
}
if (_ttl != no_ttl && _expiry < now) {
if (_ttl != no_ttl && _expiry <= now) {
return false;
}
return _timestamp > t.timestamp;
@@ -407,7 +407,7 @@ public:
if (_ttl == dead) {
return true;
}
return _ttl != no_ttl && _expiry < now;
return _ttl != no_ttl && _expiry <= now;
}
// Can be called only when is_live().
bool is_expiring() const {
@@ -447,7 +447,7 @@ public:
_timestamp = api::missing_timestamp;
return false;
}
if (_ttl > no_ttl && _expiry < now) {
if (_ttl > no_ttl && _expiry <= now) {
_expiry -= _ttl;
_ttl = dead;
}
@@ -940,6 +940,9 @@ private:
// Contains only strict prefixes so that we don't have to lookup full keys
// in both _row_tombstones and _rows.
range_tombstone_list _row_tombstones;
#ifdef SEASTAR_DEBUG
table_schema_version _schema_version;
#endif
friend class mutation_partition_applier;
friend class converting_mutation_partition_applier;
@@ -954,10 +957,16 @@ public:
mutation_partition(schema_ptr s)
: _rows()
, _row_tombstones(*s)
#ifdef SEASTAR_DEBUG
, _schema_version(s->version())
#endif
{ }
mutation_partition(mutation_partition& other, copy_comparators_only)
: _rows()
, _row_tombstones(other._row_tombstones, range_tombstone_list::copy_comparator_only())
#ifdef SEASTAR_DEBUG
, _schema_version(other._schema_version)
#endif
{ }
mutation_partition(mutation_partition&&) = default;
mutation_partition(const schema& s, const mutation_partition&);
@@ -1181,6 +1190,12 @@ private:
template<typename Func>
void for_each_row(const schema& schema, const query::clustering_range& row_range, bool reversed, Func&& func) const;
friend class counter_write_query_result_builder;
void check_schema(const schema& s) const {
#ifdef SEASTAR_DEBUG
assert(s.version() == _schema_version);
#endif
}
};
inline

View File

@@ -31,7 +31,7 @@ reconcilable_result::reconcilable_result()
: _row_count(0)
{ }
reconcilable_result::reconcilable_result(uint32_t row_count, std::vector<partition> p, query::short_read short_read,
reconcilable_result::reconcilable_result(uint32_t row_count, utils::chunked_vector<partition> p, query::short_read short_read,
query::result_memory_tracker memory_tracker)
: _row_count(row_count)
, _short_read(short_read)
@@ -39,11 +39,11 @@ reconcilable_result::reconcilable_result(uint32_t row_count, std::vector<partiti
, _partitions(std::move(p))
{ }
const std::vector<partition>& reconcilable_result::partitions() const {
const utils::chunked_vector<partition>& reconcilable_result::partitions() const {
return _partitions;
}
std::vector<partition>& reconcilable_result::partitions() {
utils::chunked_vector<partition>& reconcilable_result::partitions() {
return _partitions;
}

View File

@@ -27,6 +27,7 @@
#include "frozen_mutation.hh"
#include "db/timeout_clock.hh"
#include "querier.hh"
#include "utils/chunked_vector.hh"
#include <seastar/core/execution_stage.hh>
class reconcilable_result;
@@ -72,17 +73,17 @@ class reconcilable_result {
uint32_t _row_count;
query::short_read _short_read;
query::result_memory_tracker _memory_tracker;
std::vector<partition> _partitions;
utils::chunked_vector<partition> _partitions;
public:
~reconcilable_result();
reconcilable_result();
reconcilable_result(reconcilable_result&&) = default;
reconcilable_result& operator=(reconcilable_result&&) = default;
reconcilable_result(uint32_t row_count, std::vector<partition> partitions, query::short_read short_read,
reconcilable_result(uint32_t row_count, utils::chunked_vector<partition> partitions, query::short_read short_read,
query::result_memory_tracker memory_tracker = { });
const std::vector<partition>& partitions() const;
std::vector<partition>& partitions();
const utils::chunked_vector<partition>& partitions() const;
utils::chunked_vector<partition>& partitions();
uint32_t row_count() const {
return _row_count;
@@ -112,7 +113,7 @@ class reconcilable_result_builder {
const schema& _schema;
const query::partition_slice& _slice;
std::vector<partition> _result;
utils::chunked_vector<partition> _result;
uint32_t _live_rows{};
bool _has_ck_selector{};

View File

@@ -910,9 +910,10 @@ class shard_reader : public enable_lw_shared_from_this<shard_reader>, public fla
bool _reader_created = false;
bool _drop_partition_start = false;
bool _drop_static_row = false;
position_in_partition::tri_compare _tri_cmp;
std::optional<dht::decorated_key> _last_pkey;
std::optional<position_in_partition> _last_position_in_partition;
position_in_partition _next_position_in_partition = position_in_partition::for_partition_start();
// These are used when the reader has to be recreated (after having been
// evicted while paused) and the range and/or slice it is recreated with
// differs from the original ones.
@@ -920,13 +921,13 @@ class shard_reader : public enable_lw_shared_from_this<shard_reader>, public fla
std::optional<query::partition_slice> _slice_override;
private:
void update_last_position(const circular_buffer<mutation_fragment>& buffer);
void update_next_position(flat_mutation_reader& reader, circular_buffer<mutation_fragment>& buffer);
void adjust_partition_slice();
flat_mutation_reader recreate_reader();
flat_mutation_reader resume_or_create_reader();
bool should_drop_fragment(const mutation_fragment& mf);
future<> do_fill_buffer(flat_mutation_reader& reader, db::timeout_clock::time_point timeout);
future<> ensure_buffer_contains_all_fragments_for_last_pos(flat_mutation_reader& reader, circular_buffer<mutation_fragment>& buffer,
db::timeout_clock::time_point timeout);
future<> fill_buffer(flat_mutation_reader& reader, circular_buffer<mutation_fragment>& buffer, db::timeout_clock::time_point timeout);
public:
remote_reader(
@@ -1024,7 +1025,7 @@ void shard_reader::stop() noexcept {
}).finally([zis = shared_from_this()] {}));
}
void shard_reader::remote_reader::update_last_position(const circular_buffer<mutation_fragment>& buffer) {
void shard_reader::remote_reader::update_next_position(flat_mutation_reader& reader, circular_buffer<mutation_fragment>& buffer) {
if (buffer.empty()) {
return;
}
@@ -1035,7 +1036,31 @@ void shard_reader::remote_reader::update_last_position(const circular_buffer<mut
_last_pkey = pk_it->as_partition_start().key();
}
_last_position_in_partition.emplace(buffer.back().position());
const auto last_pos = buffer.back().position();
switch (last_pos.region()) {
case partition_region::partition_start:
_next_position_in_partition = position_in_partition::for_static_row();
break;
case partition_region::static_row:
_next_position_in_partition = position_in_partition::before_all_clustered_rows();
break;
case partition_region::clustered:
if (reader.is_buffer_empty()) {
_next_position_in_partition = position_in_partition::after_key(last_pos);
} else {
const auto& next_frag = reader.peek_buffer();
if (next_frag.is_end_of_partition()) {
buffer.emplace_back(reader.pop_mutation_fragment());
_next_position_in_partition = position_in_partition::for_partition_start();
} else {
_next_position_in_partition = position_in_partition(next_frag.position());
}
}
break;
case partition_region::partition_end:
_next_position_in_partition = position_in_partition::for_partition_start();
break;
}
}
void shard_reader::remote_reader::adjust_partition_slice() {
@@ -1043,9 +1068,8 @@ void shard_reader::remote_reader::adjust_partition_slice() {
_slice_override = _ps;
}
auto& last_ckey = _last_position_in_partition->key();
auto ranges = _slice_override->default_row_ranges();
query::trim_clustering_row_ranges_to(*_schema, ranges, last_ckey);
query::trim_clustering_row_ranges_to(*_schema, ranges, _next_position_in_partition);
_slice_override->clear_ranges();
_slice_override->set_range(*_schema, _last_pkey->key(), std::move(ranges));
@@ -1058,25 +1082,22 @@ flat_mutation_reader shard_reader::remote_reader::recreate_reader() {
if (_last_pkey) {
bool partition_range_is_inclusive = true;
if (_last_position_in_partition) {
switch (_last_position_in_partition->region()) {
case partition_region::partition_start:
_drop_partition_start = true;
break;
case partition_region::static_row:
_drop_partition_start = true;
_drop_static_row = true;
break;
case partition_region::clustered:
_drop_partition_start = true;
_drop_static_row = true;
adjust_partition_slice();
slice = &*_slice_override;
break;
case partition_region::partition_end:
partition_range_is_inclusive = false;
break;
}
switch (_next_position_in_partition.region()) {
case partition_region::partition_start:
partition_range_is_inclusive = false;
break;
case partition_region::static_row:
_drop_partition_start = true;
break;
case partition_region::clustered:
_drop_partition_start = true;
_drop_static_row = true;
adjust_partition_slice();
slice = &*_slice_override;
break;
case partition_region::partition_end:
partition_range_is_inclusive = false;
break;
}
// The original range contained a single partition and we've read it
@@ -1115,62 +1136,83 @@ flat_mutation_reader shard_reader::remote_reader::resume_or_create_reader() {
return recreate_reader();
}
bool shard_reader::remote_reader::should_drop_fragment(const mutation_fragment& mf) {
if (_drop_partition_start && mf.is_partition_start()) {
_drop_partition_start = false;
return true;
}
if (_drop_static_row && mf.is_static_row()) {
_drop_static_row = false;
return true;
}
return false;
}
future<> shard_reader::remote_reader::do_fill_buffer(flat_mutation_reader& reader, db::timeout_clock::time_point timeout) {
if (!_drop_partition_start && !_drop_static_row) {
return reader.fill_buffer(timeout);
}
return repeat([this, &reader, timeout] {
return reader.fill_buffer(timeout).then([this, &reader] {
const auto eos = reader.is_end_of_stream();
if (reader.is_buffer_empty()) {
return stop_iteration(eos);
while (!reader.is_buffer_empty() && should_drop_fragment(reader.peek_buffer())) {
reader.pop_mutation_fragment();
}
if (_drop_partition_start) {
_drop_partition_start = false;
if (reader.peek_buffer().is_partition_start()) {
reader.pop_mutation_fragment();
}
}
if (reader.is_buffer_empty()) {
return stop_iteration(eos);
}
if (_drop_static_row) {
_drop_static_row = false;
if (reader.peek_buffer().is_static_row()) {
reader.pop_mutation_fragment();
}
}
return stop_iteration(reader.is_buffer_full() || eos);
return stop_iteration(reader.is_buffer_full() || reader.is_end_of_stream());
});
});
}
future<> shard_reader::remote_reader::ensure_buffer_contains_all_fragments_for_last_pos(flat_mutation_reader& reader,
circular_buffer<mutation_fragment>& buffer, db::timeout_clock::time_point timeout) {
if (buffer.empty() || !buffer.back().is_range_tombstone()) {
return make_ready_future<>();
}
auto stop = [this, &reader, &buffer] {
future<> shard_reader::remote_reader::fill_buffer(flat_mutation_reader& reader, circular_buffer<mutation_fragment>& buffer,
db::timeout_clock::time_point timeout) {
return do_fill_buffer(reader, timeout).then([this, &reader, &buffer, timeout] {
if (reader.is_buffer_empty()) {
return reader.is_end_of_stream();
return make_ready_future<>();
}
const auto& next_pos = reader.peek_buffer().position();
if (next_pos.region() != partition_region::clustered) {
return true;
}
return !next_pos.key().equal(*_schema, buffer.back().position().key());
};
return do_until(stop, [this, &reader, &buffer, timeout] {
if (reader.is_buffer_empty()) {
return do_fill_buffer(reader, timeout);
}
buffer.emplace_back(reader.pop_mutation_fragment());
return make_ready_future<>();
buffer = reader.detach_buffer();
auto stop = [this, &reader, &buffer] {
// The only problematic fragment kind is the range tombstone.
// All other fragment kinds are safe to end the buffer on, and
// are guaranteed to represent progress vs. the last buffer fill.
if (!buffer.back().is_range_tombstone()) {
return true;
}
if (reader.is_buffer_empty()) {
return reader.is_end_of_stream();
}
const auto& next_pos = reader.peek_buffer().position();
// To ensure safe progress we have to ensure the following:
//
// _next_position_in_partition < buffer.back().position() < next_pos
//
// * The first condition is to ensure we made progress since the
// last buffer fill. Otherwise we might get into an endless loop if
// the reader is recreated after each `fill_buffer()` call.
// * The second condition is to ensure we have seen all fragments
// with the same position. Otherwise we might jump over those
// remaining fragments with the same position as the last
// fragment's in the buffer when the reader is recreated.
return _tri_cmp(_next_position_in_partition, buffer.back().position()) < 0 && _tri_cmp(buffer.back().position(), next_pos) < 0;
};
// Read additional fragments until it is safe to stop, if needed.
// We have to ensure we stop at a fragment such that if the reader is
// evicted and recreated later, we won't be skipping any fragments.
// Practically, range tombstones are the only ones that are
// problematic to end the buffer on. This is due to the fact range
// tombstones can have the same position that multiple following range
// tombstones, or a single following clustering row in the stream has.
// When a range tombstone is the last in the buffer, we have to continue
// to read until we are sure we've read all fragments sharing the same
// position, so that we can safely continue reading from after said
// position.
return do_until(stop, [this, &reader, &buffer, timeout] {
if (reader.is_buffer_empty()) {
return do_fill_buffer(reader, timeout);
}
buffer.emplace_back(reader.pop_mutation_fragment());
return make_ready_future<>();
});
}).then([this, &reader, &buffer] {
update_next_position(reader, buffer);
});
}
@@ -1188,7 +1230,8 @@ shard_reader::remote_reader::remote_reader(
, _ps(ps)
, _pc(pc)
, _trace_state(std::move(trace_state))
, _fwd_mr(fwd_mr) {
, _fwd_mr(fwd_mr)
, _tri_cmp(*_schema) {
}
future<shard_reader::fill_buffer_result> shard_reader::remote_reader::fill_buffer(const dht::partition_range& pr, bool pending_next_partition,
@@ -1196,7 +1239,7 @@ future<shard_reader::fill_buffer_result> shard_reader::remote_reader::fill_buffe
// We could have missed a `fast_forward_to()` if the reader wasn't created yet.
_pr = &pr;
if (pending_next_partition) {
_last_position_in_partition = position_in_partition(position_in_partition::end_of_partition_tag_t{});
_next_position_in_partition = position_in_partition::for_partition_start();
}
return do_with(resume_or_create_reader(), circular_buffer<mutation_fragment>{},
[this, pending_next_partition, timeout] (flat_mutation_reader& reader, circular_buffer<mutation_fragment>& buffer) mutable {
@@ -1204,22 +1247,8 @@ future<shard_reader::fill_buffer_result> shard_reader::remote_reader::fill_buffe
reader.next_partition();
}
return do_fill_buffer(reader, timeout).then([this, &reader, &buffer, timeout] {
buffer = reader.detach_buffer();
// When the reader is recreated (after having been evicted) we
// recreate it such that it starts reading from *after* the last
// seen fragment's position. If the last seen fragment is a range
// tombstone it is *not* guaranteed that the next fragments in the
// data stream have positions strictly greater than the range
// tombstone's. If the reader is evicted and has to be recreated,
// these fragments would be then skipped as the read would continue
// after their position.
// To avoid this ensure that the buffer contains *all* fragments for
// the last seen position.
return ensure_buffer_contains_all_fragments_for_last_pos(reader, buffer, timeout);
}).then([this, &reader, &buffer] {
return fill_buffer(reader, buffer, timeout).then([this, &reader, &buffer] {
const auto eos = reader.is_end_of_stream() && reader.is_buffer_empty();
update_last_position(buffer);
_irh = _lifecycle_policy.pause(std::move(reader));
return fill_buffer_result(std::move(buffer), eos);
});
@@ -1229,7 +1258,7 @@ future<shard_reader::fill_buffer_result> shard_reader::remote_reader::fill_buffe
future<> shard_reader::remote_reader::fast_forward_to(const dht::partition_range& pr, db::timeout_clock::time_point timeout) {
_pr = &pr;
_last_pkey.reset();
_last_position_in_partition.reset();
_next_position_in_partition = position_in_partition::for_partition_start();
if (!_reader_created || !_irh) {
return make_ready_future<>();

View File

@@ -172,6 +172,9 @@ tombstone partition_entry::partition_tombstone() const {
partition_snapshot::~partition_snapshot() {
with_allocator(region().allocator(), [this] {
if (_locked) {
touch();
}
if (_version && _version.is_unique_owner()) {
auto v = &*_version;
_version = {};
@@ -268,6 +271,7 @@ partition_entry::~partition_entry() {
return;
}
if (_snapshot) {
assert(!_snapshot->is_locked());
_snapshot->_version = std::move(_version);
_snapshot->_version.mark_as_unique_owner();
_snapshot->_entry = nullptr;
@@ -284,6 +288,7 @@ stop_iteration partition_entry::clear_gently(cache_tracker* tracker) noexcept {
}
if (_snapshot) {
assert(!_snapshot->is_locked());
_snapshot->_version = std::move(_version);
_snapshot->_version.mark_as_unique_owner();
_snapshot->_entry = nullptr;
@@ -311,6 +316,7 @@ stop_iteration partition_entry::clear_gently(cache_tracker* tracker) noexcept {
void partition_entry::set_version(partition_version* new_version)
{
if (_snapshot) {
assert(!_snapshot->is_locked());
_snapshot->_version = std::move(_version);
_snapshot->_entry = nullptr;
}
@@ -338,7 +344,7 @@ partition_version& partition_entry::add_version(const schema& s, cache_tracker*
void partition_entry::apply(const schema& s, const mutation_partition& mp, const schema& mp_schema)
{
apply(s, mutation_partition(s, mp), mp_schema);
apply(s, mutation_partition(mp_schema, mp), mp_schema);
}
void partition_entry::apply(const schema& s, mutation_partition&& mp, const schema& mp_schema)
@@ -459,7 +465,6 @@ public:
coroutine partition_entry::apply_to_incomplete(const schema& s,
partition_entry&& pe,
const schema& pe_schema,
mutation_cleaner& pe_cleaner,
logalloc::allocating_section& alloc,
logalloc::region& reg,
@@ -479,10 +484,6 @@ coroutine partition_entry::apply_to_incomplete(const schema& s,
// partitions where I saw 40% slow down.
const bool preemptible = s.clustering_key_size() > 0;
if (s.version() != pe_schema.version()) {
pe.upgrade(pe_schema.shared_from_this(), s.shared_from_this(), pe_cleaner, no_cache_tracker);
}
// When preemptible, later memtable reads could start using the snapshot before
// snapshot's writes are made visible in cache, which would cause them to miss those writes.
// So we cannot allow erasing when preemptible.
@@ -496,6 +497,7 @@ coroutine partition_entry::apply_to_incomplete(const schema& s,
prev_snp = read(reg, tracker.cleaner(), s.shared_from_this(), &tracker, phase - 1);
}
auto dst_snp = read(reg, tracker.cleaner(), s.shared_from_this(), &tracker, phase);
dst_snp->lock();
// Once we start updating the partition, we must keep all snapshots until the update completes,
// otherwise partial writes would be published. So the scope of snapshots must enclose the scope
@@ -570,6 +572,7 @@ coroutine partition_entry::apply_to_incomplete(const schema& s,
auto has_next = src_cur.erase_and_advance();
acc.unpin_memory(size);
if (!has_next) {
dst_snp->unlock();
return stop_iteration::yes;
}
} while (!preemptible || !need_preempt());
@@ -661,6 +664,18 @@ partition_snapshot::range_tombstones()
position_in_partition_view::after_all_clustered_rows());
}
void partition_snapshot::touch() noexcept {
// Eviction assumes that older versions are evicted before newer so only the latest snapshot
// can be touched.
if (_tracker && at_latest_version()) {
auto&& rows = version()->partition().clustered_rows();
assert(!rows.empty());
rows_entry& last_dummy = *rows.rbegin();
assert(last_dummy.is_last_dummy());
_tracker->touch(last_dummy);
}
}
std::ostream& operator<<(std::ostream& out, const partition_entry::printer& p) {
auto& e = p._partition_entry;
out << "{";
@@ -688,6 +703,7 @@ void partition_entry::evict(mutation_cleaner& cleaner) noexcept {
return;
}
if (_snapshot) {
assert(!_snapshot->is_locked());
_snapshot->_version = std::move(_version);
_snapshot->_version.mark_as_unique_owner();
_snapshot->_entry = nullptr;
@@ -707,3 +723,18 @@ partition_snapshot_ptr::~partition_snapshot_ptr() {
}
}
}
void partition_snapshot::lock() noexcept {
// partition_entry::is_locked() assumes that if there is a locked snapshot,
// it can be found attached directly to it.
assert(at_latest_version());
_locked = true;
}
void partition_snapshot::unlock() noexcept {
// Locked snapshots must always be latest, is_locked() assumes that.
// Also, touch() is only effective when this snapshot is latest.
assert(at_latest_version());
_locked = false;
touch(); // Make the entry evictable again in case it was fully unlinked by eviction attempt.
}

View File

@@ -303,6 +303,7 @@ private:
mutation_cleaner* _cleaner;
cache_tracker* _tracker;
boost::intrusive::slist_member_hook<> _cleaner_hook;
bool _locked = false;
friend class partition_entry;
friend class mutation_cleaner_impl;
public:
@@ -318,6 +319,22 @@ public:
partition_snapshot& operator=(const partition_snapshot&) = delete;
partition_snapshot& operator=(partition_snapshot&&) = delete;
// Makes the snapshot locked.
// See is_locked() for meaning.
// Can be called only when at_lastest_version(). The snapshot must remain latest as long as it's locked.
void lock() noexcept;
// Makes the snapshot no longer locked.
// See is_locked() for meaning.
void unlock() noexcept;
// Tells whether the snapshot is locked.
// Locking the snapshot prevents it from getting detached from the partition entry.
// It also prevents the partition entry from being evicted.
bool is_locked() const {
return _locked;
}
static partition_snapshot& container_of(partition_version_ref* ref) {
return *boost::intrusive::get_parent_from_member(ref, &partition_snapshot::_version);
}
@@ -344,6 +361,9 @@ public:
// to the latest version.
stop_iteration slide_to_oldest() noexcept;
// Brings the snapshot to the front of the LRU.
void touch() noexcept;
// Must be called after snapshot's original region is merged into a different region
// before the original region is destroyed, unless the snapshot is destroyed earlier.
void migrate(logalloc::region* region, mutation_cleaner* cleaner) noexcept {
@@ -503,9 +523,18 @@ public:
return _version->all_elements_reversed();
}
// Tells whether this entry is locked.
// Locked entries are undergoing an update and should not have their snapshots
// detached from the entry.
// Certain methods can only be called when !is_locked().
bool is_locked() const {
return _snapshot && _snapshot->is_locked();
}
// Strong exception guarantees.
// Assumes this instance and mp are fully continuous.
// Use only on non-evictable entries.
// Must not be called when is_locked().
void apply(const schema& s, const mutation_partition& mp, const schema& mp_schema);
void apply(const schema& s, mutation_partition&& mp, const schema& mp_schema);
@@ -526,11 +555,14 @@ public:
// such that if the operation is retried (possibly many times) and eventually
// succeeds the result will be as if the first attempt didn't fail.
//
// The schema of pe must conform to s.
//
// Returns a coroutine object representing the operation.
// The coroutine must be resumed with the region being unlocked.
//
// The coroutine cannot run concurrently with other apply() calls.
coroutine apply_to_incomplete(const schema& s,
partition_entry&& pe,
const schema& pe_schema,
mutation_cleaner& pe_cleaner,
logalloc::allocating_section&,
logalloc::region&,
@@ -539,6 +571,7 @@ public:
real_dirty_memory_accounter&);
// If this entry is evictable, cache_tracker must be provided.
// Must not be called when is_locked().
partition_version& add_version(const schema& s, cache_tracker*);
// Returns a reference to existing version with an active snapshot of given phase
@@ -568,9 +601,11 @@ public:
tombstone partition_tombstone() const;
// needs to be called with reclaiming disabled
// Must not be called when is_locked().
void upgrade(schema_ptr from, schema_ptr to, mutation_cleaner&, cache_tracker*);
// Snapshots with different values of phase will point to different partition_version objects.
// When is_locked(), read() can only be called with a phase which is <= the phase of the current snapshot.
partition_snapshot_ptr read(logalloc::region& region,
mutation_cleaner&,
schema_ptr entry_schema,

View File

@@ -129,6 +129,8 @@ public:
: _type(partition_region::clustered), _ck(&ck) { }
position_in_partition_view(range_tag_t, bound_view bv)
: _type(partition_region::clustered), _bound_weight(position_weight(bv.kind())), _ck(&bv.prefix()) { }
position_in_partition_view(const clustering_key_prefix& ck, bound_weight w)
: _type(partition_region::clustered), _bound_weight(w), _ck(&ck) { }
static position_in_partition_view for_range_start(const query::clustering_range& r) {
return {position_in_partition_view::range_tag_t(), bound_view::from_range_start(r)};
@@ -159,6 +161,7 @@ public:
}
partition_region region() const { return _type; }
bound_weight get_bound_weight() const { return _bound_weight; }
bool is_partition_start() const { return _type == partition_region::partition_start; }
bool is_partition_end() const { return _type == partition_region::partition_end; }
bool is_static_row() const { return _type == partition_region::static_row; }
@@ -271,6 +274,10 @@ public:
return {clustering_row_tag_t(), std::move(ck)};
}
static position_in_partition for_partition_start() {
return position_in_partition{partition_start_tag_t()};
}
static position_in_partition for_static_row() {
return position_in_partition{static_row_tag_t()};
}

View File

@@ -286,11 +286,11 @@ static void insert_querier(
auto& e = entries.emplace_back(key, std::move(q), expires);
e.set_pos(--entries.end());
++stats.population;
if (auto irh = sem.register_inactive_read(std::make_unique<querier_inactive_read>(entries, e.pos(), stats))) {
e.set_inactive_handle(std::move(irh));
index.insert(e);
++stats.population;
}
}

View File

@@ -31,6 +31,8 @@
#include "tracing/tracing.hh"
#include "utils/small_vector.hh"
class position_in_partition_view;
namespace query {
using column_id_vector = utils::small_vector<column_id, 8>;
@@ -58,10 +60,20 @@ typedef std::vector<clustering_range> clustering_row_ranges;
/// Trim the clustering ranges.
///
/// Equivalent of intersecting each range with [key, +inf), or (-inf, key] if
/// Equivalent of intersecting each clustering range with [pos, +inf) position
/// in partition range, or (-inf, pos] position in partition range if
/// reversed == true. Ranges that do not intersect are dropped. Ranges that
/// partially overlap are trimmed.
/// Result: each range will overlap fully with [key, +inf), or (-int, key] if
/// Result: each range will overlap fully with [pos, +inf), or (-int, pos] if
/// reversed is true.
void trim_clustering_row_ranges_to(const schema& s, clustering_row_ranges& ranges, position_in_partition_view pos, bool reversed = false);
/// Trim the clustering ranges.
///
/// Equivalent of intersecting each clustering range with (key, +inf) clustering
/// range, or (-inf, key) clustering range if reversed == true. Ranges that do
/// not intersect are dropped. Ranges that partially overlap are trimmed.
/// Result: each range will overlap fully with (key, +inf), or (-int, key) if
/// reversed is true.
void trim_clustering_row_ranges_to(const schema& s, clustering_row_ranges& ranges, const clustering_key& key, bool reversed = false);

View File

@@ -71,34 +71,38 @@ std::ostream& operator<<(std::ostream& out, const specific_ranges& s) {
return out << "{" << s._pk << " : " << join(", ", s._ranges) << "}";
}
void trim_clustering_row_ranges_to(const schema& s, clustering_row_ranges& ranges, const clustering_key& key, bool reversed) {
auto cmp = [reversed, bv_cmp = bound_view::compare(s)] (const auto& a, const auto& b) {
return reversed ? bv_cmp(b, a) : bv_cmp(a, b);
void trim_clustering_row_ranges_to(const schema& s, clustering_row_ranges& ranges, position_in_partition_view pos, bool reversed) {
auto cmp = [reversed, cmp = position_in_partition::composite_tri_compare(s)] (const auto& a, const auto& b) {
return reversed ? cmp(b, a) : cmp(a, b);
};
auto start_bound = [reversed] (const auto& range) -> const bound_view& {
return reversed ? range.second : range.first;
auto start_bound = [reversed] (const auto& range) -> position_in_partition_view {
return reversed ? position_in_partition_view::for_range_end(range) : position_in_partition_view::for_range_start(range);
};
auto end_bound = [reversed] (const auto& range) -> const bound_view& {
return reversed ? range.first : range.second;
auto end_bound = [reversed] (const auto& range) -> position_in_partition_view {
return reversed ? position_in_partition_view::for_range_start(range) : position_in_partition_view::for_range_end(range);
};
clustering_key_prefix::equality eq(s);
auto it = ranges.begin();
while (it != ranges.end()) {
auto range = bound_view::from_range(*it);
if (cmp(end_bound(range), key) || eq(end_bound(range).prefix(), key)) {
if (cmp(end_bound(*it), pos) <= 0) {
it = ranges.erase(it);
continue;
} else if (cmp(start_bound(range), key)) {
assert(cmp(key, end_bound(range)));
auto r = reversed ? clustering_range(it->start(), clustering_range::bound { key, false })
: clustering_range(clustering_range::bound { key, false }, it->end());
} else if (cmp(start_bound(*it), pos) <= 0) {
assert(cmp(pos, end_bound(*it)) < 0);
auto r = reversed ?
clustering_range(it->start(), clustering_range::bound(pos.key(), pos.get_bound_weight() != bound_weight::before_all_prefixed)) :
clustering_range(clustering_range::bound(pos.key(), pos.get_bound_weight() != bound_weight::after_all_prefixed), it->end());
*it = std::move(r);
}
++it;
}
}
void trim_clustering_row_ranges_to(const schema& s, clustering_row_ranges& ranges, const clustering_key& key, bool reversed) {
return trim_clustering_row_ranges_to(s, ranges,
position_in_partition_view(key, reversed ? bound_weight::before_all_prefixed : bound_weight::after_all_prefixed), reversed);
}
partition_slice::partition_slice(clustering_row_ranges row_ranges,
query::column_id_vector static_columns,
query::column_id_vector regular_columns,

View File

@@ -187,7 +187,7 @@ public:
const dht::decorated_key& key() const { return *_key; }
void on_underlying_created() { ++_underlying_created; }
bool digest_requested() const { return _slice.options.contains<query::partition_slice::option::with_digest>(); }
private:
public:
future<> ensure_underlying(db::timeout_clock::time_point timeout) {
if (_underlying_snapshot) {
return create_underlying(true, timeout);
@@ -206,18 +206,6 @@ public:
_underlying_snapshot = {};
_key = dk;
}
// Fast forwards the underlying streamed_mutation to given range.
future<> fast_forward_to(position_range range, db::timeout_clock::time_point timeout) {
return ensure_underlying(timeout).then([this, range = std::move(range), timeout] {
return _underlying.underlying().fast_forward_to(std::move(range), timeout);
});
}
// Gets the next fragment from the underlying reader
future<mutation_fragment_opt> get_next_fragment(db::timeout_clock::time_point timeout) {
return ensure_underlying(timeout).then([this, timeout] {
return _underlying.underlying()(timeout);
});
}
};
}

View File

@@ -8,7 +8,6 @@ print_usage() {
echo " --clean clean build directory"
echo " --compiler C++ compiler path"
echo " --c-compiler C compiler path"
echo " --nodeps skip installing dependencies"
exit 1
}
@@ -16,7 +15,6 @@ JOBS=
CLEAN=
COMPILER=
CCOMPILER=
NODEPS=
while [ $# -gt 0 ]; do
case "$1" in
"--jobs")
@@ -36,7 +34,6 @@ while [ $# -gt 0 ]; do
shift 2
;;
"--nodeps")
NODEPS=yes
shift 1
;;
*)
@@ -66,10 +63,6 @@ if [ -f build/release/scylla-package.tar.gz ]; then
rm build/release/scylla-package.tar.gz
fi
if [ -z "$NODEPS" ]; then
sudo ./install-dependencies.sh
fi
NINJA=$(which ninja-build) &&:
if [ -z "$NINJA" ]; then
NINJA=$(which ninja) &&:

37
reloc/python3/build_deb.sh Executable file
View File

@@ -0,0 +1,37 @@
#!/bin/bash -e
. /etc/os-release
print_usage() {
echo "build_deb.sh --reloc-pkg build/release/scylla-python3-package.tar.gz"
echo " --reloc-pkg specify relocatable package path"
exit 1
}
RELOC_PKG=build/release/scylla-python3-package.tar.gz
OPTS=""
while [ $# -gt 0 ]; do
case "$1" in
"--reloc-pkg")
OPTS="$OPTS $1 $(readlink -f $2)"
RELOC_PKG=$2
shift 2
;;
*)
print_usage
;;
esac
done
if [ ! -e $RELOC_PKG ]; then
echo "$RELOC_PKG does not exist."
echo "Run ./reloc/python3/build_reloc.sh first."
exit 1
fi
RELOC_PKG=$(readlink -f $RELOC_PKG)
if [[ ! $OPTS =~ --reloc-pkg ]]; then
OPTS="$OPTS --reloc-pkg $RELOC_PKG"
fi
mkdir -p build/debian/scylla-python3-package
tar -C build/debian/scylla-python3-package -xpf $RELOC_PKG
cd build/debian/scylla-python3-package
exec ./dist/debian/python3/build_deb.sh $OPTS

View File

@@ -780,8 +780,10 @@ static future<> repair_cf_range(repair_info& ri,
// still do our best to repair available replicas.
std::vector<gms::inet_address> live_neighbors;
std::vector<partition_checksum> live_neighbors_checksum;
bool local_checksum_failed = false;
for (unsigned i = 0; i < checksums.size(); i++) {
if (checksums[i].failed()) {
local_checksum_failed |= (i == 0);
rlogger.warn(
"Checksum of ks={}, table={}, range={} on {} failed: {}",
ri.keyspace, cf, range,
@@ -797,7 +799,7 @@ static future<> repair_cf_range(repair_info& ri,
live_neighbors_checksum.push_back(checksums[i].get0());
}
}
if (checksums[0].failed() || live_neighbors.empty()) {
if (local_checksum_failed || live_neighbors.empty()) {
return make_ready_future<>();
}
// If one of the available checksums is different, repair
@@ -940,8 +942,20 @@ static future<> repair_cf_range(repair_info& ri,
// Comparable to RepairSession in Origin
static future<> repair_range(repair_info& ri, const dht::token_range& range) {
auto id = utils::UUID_gen::get_time_UUID();
return do_with(get_neighbors(ri.db.local(), ri.keyspace, range, ri.data_centers, ri.hosts), [&ri, range, id] (const auto& neighbors) {
rlogger.debug("[repair #{}] new session: will sync {} on range {} for {}.{}", id, neighbors, range, ri.keyspace, ri.cfs);
return do_with(get_neighbors(ri.db.local(), ri.keyspace, range, ri.data_centers, ri.hosts), [&ri, range, id] (std::vector<gms::inet_address>& neighbors) {
auto live_neighbors = boost::copy_range<std::vector<gms::inet_address>>(neighbors |
boost::adaptors::filtered([] (const gms::inet_address& node) { return gms::get_local_gossiper().is_alive(node); }));
if (live_neighbors.size() != neighbors.size()) {
ri.nr_failed_ranges++;
auto status = live_neighbors.empty() ? "skipped" : "partial";
rlogger.warn("Repair {} out of {} ranges, id={}, shard={}, keyspace={}, table={}, range={}, peers={}, live_peers={}, status={}",
ri.ranges_index, ri.ranges.size(), ri.id, ri.shard, ri.keyspace, ri.cfs, range, neighbors, live_neighbors, status);
if (live_neighbors.empty()) {
return make_ready_future<>();
}
neighbors.swap(live_neighbors);
}
return ::service::get_local_migration_manager().sync_schema(ri.db.local(), neighbors).then([&neighbors, &ri, range, id] {
return do_for_each(ri.cfs.begin(), ri.cfs.end(), [&ri, &neighbors, range] (auto&& cf) {
ri._sub_ranges_nr++;
if (ri.row_level_repair()) {
@@ -950,6 +964,7 @@ static future<> repair_range(repair_info& ri, const dht::token_range& range) {
return repair_cf_range(ri, cf, range, neighbors);
}
});
});
});
}

View File

@@ -295,6 +295,7 @@ public:
void push_mutation_fragment(frozen_mutation_fragment mf) { _mfs.push_back(std::move(mf)); }
};
using repair_row_on_wire = partition_key_and_mutation_fragments;
using repair_rows_on_wire = std::list<partition_key_and_mutation_fragments>;
enum class row_level_diff_detect_algorithm : uint8_t {

View File

@@ -152,8 +152,8 @@ class fragment_hasher {
xx_hasher& _hasher;
private:
void consume_cell(const column_definition& col, const atomic_cell_or_collection& cell) {
feed_hash(_hasher, col.name());
feed_hash(_hasher, col.type->name());
feed_hash(_hasher, col.kind);
feed_hash(_hasher, col.id);
feed_hash(_hasher, cell, col);
}
public:
@@ -220,43 +220,62 @@ private:
};
class repair_row {
frozen_mutation_fragment _fm;
std::optional<frozen_mutation_fragment> _fm;
lw_shared_ptr<const decorated_key_with_hash> _dk_with_hash;
repair_sync_boundary _boundary;
repair_hash _hash;
std::optional<repair_sync_boundary> _boundary;
std::optional<repair_hash> _hash;
lw_shared_ptr<mutation_fragment> _mf;
public:
repair_row() = delete;
repair_row(frozen_mutation_fragment fm,
position_in_partition pos,
repair_row(std::optional<frozen_mutation_fragment> fm,
std::optional<position_in_partition> pos,
lw_shared_ptr<const decorated_key_with_hash> dk_with_hash,
repair_hash hash,
std::optional<repair_hash> hash,
lw_shared_ptr<mutation_fragment> mf = {})
: _fm(std::move(fm))
, _dk_with_hash(std::move(dk_with_hash))
, _boundary({_dk_with_hash->dk, std::move(pos)})
, _boundary(pos ? std::optional<repair_sync_boundary>(repair_sync_boundary{_dk_with_hash->dk, std::move(*pos)}) : std::nullopt)
, _hash(std::move(hash))
, _mf(std::move(mf)) {
}
mutation_fragment& get_mutation_fragment() {
if (!_mf) {
throw std::runtime_error("get empty mutation_fragment");
throw std::runtime_error("empty mutation_fragment");
}
return *_mf;
}
frozen_mutation_fragment& get_frozen_mutation() { return _fm; }
const frozen_mutation_fragment& get_frozen_mutation() const { return _fm; }
frozen_mutation_fragment& get_frozen_mutation() {
if (!_fm) {
throw std::runtime_error("empty frozen_mutation_fragment");
}
return *_fm;
}
const frozen_mutation_fragment& get_frozen_mutation() const {
if (!_fm) {
throw std::runtime_error("empty frozen_mutation_fragment");
}
return *_fm;
}
const lw_shared_ptr<const decorated_key_with_hash>& get_dk_with_hash() const {
return _dk_with_hash;
}
size_t size() const {
return _fm.representation().size();
if (!_fm) {
throw std::runtime_error("empty size due to empty frozen_mutation_fragment");
}
return _fm->representation().size();
}
const repair_sync_boundary& boundary() const {
return _boundary;
if (!_boundary) {
throw std::runtime_error("empty repair_sync_boundary");
}
return *_boundary;
}
const repair_hash& hash() const {
return _hash;
if (!_hash) {
throw std::runtime_error("empty hash");
}
return *_hash;
}
};
@@ -284,13 +303,14 @@ public:
repair_reader(
seastar::sharded<database>& db,
column_family& cf,
schema_ptr s,
dht::token_range range,
dht::i_partitioner& local_partitioner,
dht::i_partitioner& remote_partitioner,
unsigned remote_shard,
uint64_t seed,
is_local_reader local_reader)
: _schema(cf.schema())
: _schema(s)
, _range(dht::to_partition_range(range))
, _sharder(remote_partitioner, range, remote_shard)
, _seed(seed)
@@ -351,6 +371,10 @@ class repair_writer {
std::vector<std::optional<seastar::queue<mutation_fragment_opt>>> _mq;
// Current partition written to disk
std::vector<lw_shared_ptr<const decorated_key_with_hash>> _current_dk_written_to_sstable;
// Is current partition still open. A partition is opened when a
// partition_start is written and is closed when a partition_end is
// written.
std::vector<bool> _partition_opened;
public:
repair_writer(
schema_ptr schema,
@@ -365,10 +389,13 @@ public:
future<> write_start_and_mf(lw_shared_ptr<const decorated_key_with_hash> dk, mutation_fragment mf, unsigned node_idx) {
_current_dk_written_to_sstable[node_idx] = dk;
if (mf.is_partition_start()) {
return _mq[node_idx]->push_eventually(mutation_fragment_opt(std::move(mf)));
return _mq[node_idx]->push_eventually(mutation_fragment_opt(std::move(mf))).then([this, node_idx] {
_partition_opened[node_idx] = true;
});
} else {
auto start = mutation_fragment(partition_start(dk->dk, tombstone()));
return _mq[node_idx]->push_eventually(mutation_fragment_opt(std::move(start))).then([this, node_idx, mf = std::move(mf)] () mutable {
_partition_opened[node_idx] = true;
return _mq[node_idx]->push_eventually(mutation_fragment_opt(std::move(mf)));
});
}
@@ -378,6 +405,7 @@ public:
_writer_done.resize(_nr_peer_nodes);
_mq.resize(_nr_peer_nodes);
_current_dk_written_to_sstable.resize(_nr_peer_nodes);
_partition_opened.resize(_nr_peer_nodes, false);
}
void create_writer(unsigned node_idx) {
@@ -414,12 +442,21 @@ public:
t.stream_in_progress());
}
future<> write_partition_end(unsigned node_idx) {
if (_partition_opened[node_idx]) {
return _mq[node_idx]->push_eventually(mutation_fragment(partition_end())).then([this, node_idx] {
_partition_opened[node_idx] = false;
});
}
return make_ready_future<>();
}
future<> do_write(unsigned node_idx, lw_shared_ptr<const decorated_key_with_hash> dk, mutation_fragment mf) {
if (_current_dk_written_to_sstable[node_idx]) {
if (_current_dk_written_to_sstable[node_idx]->dk.equal(*_schema, dk->dk)) {
return _mq[node_idx]->push_eventually(mutation_fragment_opt(std::move(mf)));
} else {
return _mq[node_idx]->push_eventually(mutation_fragment(partition_end())).then([this,
return write_partition_end(node_idx).then([this,
node_idx, dk = std::move(dk), mf = std::move(mf)] () mutable {
return write_start_and_mf(std::move(dk), std::move(mf), node_idx);
});
@@ -433,7 +470,7 @@ public:
return parallel_for_each(boost::irange(unsigned(0), unsigned(_nr_peer_nodes)), [this] (unsigned node_idx) {
if (_writer_done[node_idx] && _mq[node_idx]) {
// Partition_end is never sent on wire, so we have to write one ourselves.
return _mq[node_idx]->push_eventually(mutation_fragment(partition_end())).then([this, node_idx] () mutable {
return write_partition_end(node_idx).then([this, node_idx] () mutable {
// Empty mutation_fragment_opt means no more data, so the writer can seal the sstables.
return _mq[node_idx]->push_eventually(mutation_fragment_opt()).then([this, node_idx] () mutable {
return (*_writer_done[node_idx]).then([] (uint64_t partitions) {
@@ -458,8 +495,8 @@ public:
private:
seastar::sharded<database>& _db;
column_family& _cf;
dht::token_range _range;
schema_ptr _schema;
dht::token_range _range;
repair_sync_boundary::tri_compare _cmp;
// The algorithm used to find the row difference
row_level_diff_detect_algorithm _algo;
@@ -519,6 +556,7 @@ public:
repair_meta(
seastar::sharded<database>& db,
column_family& cf,
schema_ptr s,
dht::token_range range,
row_level_diff_detect_algorithm algo,
size_t max_row_buf_size,
@@ -529,8 +567,8 @@ public:
size_t nr_peer_nodes = 1)
: _db(db)
, _cf(cf)
, _schema(s)
, _range(range)
, _schema(cf.schema())
, _cmp(repair_sync_boundary::tri_compare(*_schema))
, _algo(algo)
, _max_row_buf_size(max_row_buf_size)
@@ -545,6 +583,7 @@ public:
, _repair_reader(
_db,
_cf,
_schema,
_range,
dht::global_partitioner(),
*_remote_partitioner,
@@ -577,35 +616,45 @@ public:
}
}
static void
static future<>
insert_repair_meta(const gms::inet_address& from,
uint32_t src_cpu_id,
uint32_t repair_meta_id,
sstring ks_name,
sstring cf_name,
dht::token_range range,
row_level_diff_detect_algorithm algo,
uint64_t max_row_buf_size,
uint64_t seed,
shard_config master_node_shard_config) {
node_repair_meta_id id{from, repair_meta_id};
auto& db = service::get_local_storage_proxy().get_db();
auto& cf = db.local().find_column_family(ks_name, cf_name);
auto rm = make_lw_shared<repair_meta>(db,
cf,
shard_config master_node_shard_config,
table_schema_version schema_version) {
return service::get_schema_for_write(schema_version, {from, src_cpu_id}).then([from,
repair_meta_id,
range,
algo,
max_row_buf_size,
seed,
repair_meta::repair_master::no,
repair_meta_id,
std::move(master_node_shard_config));
bool insertion = repair_meta_map().emplace(id, rm).second;
if (!insertion) {
rlogger.warn("insert_repair_meta: repair_meta_id {} for node {} already exists, replace existing one", id.repair_meta_id, id.ip);
repair_meta_map()[id] = rm;
} else {
rlogger.debug("insert_repair_meta: Inserted repair_meta_id {} for node {}", id.repair_meta_id, id.ip);
}
master_node_shard_config,
schema_version] (schema_ptr s) {
auto& db = service::get_local_storage_proxy().get_db();
auto& cf = db.local().find_column_family(s->id());
node_repair_meta_id id{from, repair_meta_id};
auto rm = make_lw_shared<repair_meta>(db,
cf,
s,
range,
algo,
max_row_buf_size,
seed,
repair_meta::repair_master::no,
repair_meta_id,
std::move(master_node_shard_config));
bool insertion = repair_meta_map().emplace(id, rm).second;
if (!insertion) {
rlogger.warn("insert_repair_meta: repair_meta_id {} for node {} already exists, replace existing one", id.repair_meta_id, id.ip);
repair_meta_map()[id] = rm;
} else {
rlogger.debug("insert_repair_meta: Inserted repair_meta_id {} for node {}", id.repair_meta_id, id.ip);
}
});
}
static future<>
@@ -642,7 +691,11 @@ public:
}
}
return parallel_for_each(*repair_metas, [repair_metas] (auto& rm) {
return rm->stop();
return rm->stop().then([&rm] {
rm = {};
});
}).then([repair_metas, from] {
rlogger.debug("Removed all repair_meta for single node {}", from);
});
}
@@ -654,7 +707,11 @@ public:
| boost::adaptors::map_values));
repair_meta_map().clear();
return parallel_for_each(*repair_metas, [repair_metas] (auto& rm) {
return rm->stop();
return rm->stop().then([&rm] {
rm = {};
});
}).then([repair_metas] {
rlogger.debug("Removed all repair_meta for all nodes");
});
}
@@ -952,12 +1009,12 @@ private:
}
return to_repair_rows_list(rows).then([this, from, node_idx, update_buf, update_hash_set] (std::list<repair_row> row_diff) {
return do_with(std::move(row_diff), [this, from, node_idx, update_buf, update_hash_set] (std::list<repair_row>& row_diff) {
auto sz = get_repair_rows_size(row_diff);
stats().rx_row_bytes += sz;
stats().rx_row_nr += row_diff.size();
stats().rx_row_nr_peer[from] += row_diff.size();
_metrics.rx_row_nr += row_diff.size();
_metrics.rx_row_bytes += sz;
if (_repair_master) {
auto sz = get_repair_rows_size(row_diff);
stats().rx_row_bytes += sz;
stats().rx_row_nr += row_diff.size();
stats().rx_row_nr_peer[from] += row_diff.size();
}
if (update_buf) {
std::list<repair_row> tmp;
tmp.swap(_working_row_buf);
@@ -993,11 +1050,16 @@ private:
return do_with(repair_rows_on_wire(), std::move(row_list), [this] (repair_rows_on_wire& rows, std::list<repair_row>& row_list) {
return do_for_each(row_list, [this, &rows] (repair_row& r) {
auto pk = r.get_dk_with_hash()->dk.key();
auto it = std::find_if(rows.begin(), rows.end(), [&pk, s=_schema] (partition_key_and_mutation_fragments& row) { return pk.legacy_equal(*s, row.get_key()); });
if (it == rows.end()) {
rows.push_back(partition_key_and_mutation_fragments(std::move(pk), {std::move(r.get_frozen_mutation())}));
// No need to search from the beginning of the rows. Look at the end of repair_rows_on_wire is enough.
if (rows.empty()) {
rows.push_back(repair_row_on_wire(std::move(pk), {std::move(r.get_frozen_mutation())}));
} else {
it->push_mutation_fragment(std::move(r.get_frozen_mutation()));
auto& row = rows.back();
if (pk.legacy_equal(*_schema, row.get_key())) {
row.push_mutation_fragment(std::move(r.get_frozen_mutation()));
} else {
rows.push_back(repair_row_on_wire(std::move(pk), {std::move(r.get_frozen_mutation())}));
}
}
}).then([&rows] {
return std::move(rows);
@@ -1006,23 +1068,47 @@ private:
};
future<std::list<repair_row>> to_repair_rows_list(repair_rows_on_wire rows) {
return do_with(std::move(rows), std::list<repair_row>(), lw_shared_ptr<const decorated_key_with_hash>(),
[this] (repair_rows_on_wire& rows, std::list<repair_row>& row_list, lw_shared_ptr<const decorated_key_with_hash>& dk_ptr) mutable {
return do_for_each(rows, [this, &dk_ptr, &row_list] (partition_key_and_mutation_fragments& x) mutable {
return do_with(std::move(rows), std::list<repair_row>(), lw_shared_ptr<const decorated_key_with_hash>(), lw_shared_ptr<mutation_fragment>(), position_in_partition::tri_compare(*_schema),
[this] (repair_rows_on_wire& rows, std::list<repair_row>& row_list, lw_shared_ptr<const decorated_key_with_hash>& dk_ptr, lw_shared_ptr<mutation_fragment>& last_mf, position_in_partition::tri_compare& cmp) mutable {
return do_for_each(rows, [this, &dk_ptr, &row_list, &last_mf, &cmp] (partition_key_and_mutation_fragments& x) mutable {
dht::decorated_key dk = dht::global_partitioner().decorate_key(*_schema, x.get_key());
if (!(dk_ptr && dk_ptr->dk.equal(*_schema, dk))) {
dk_ptr = make_lw_shared<const decorated_key_with_hash>(*_schema, dk, _seed);
}
return do_for_each(x.get_mutation_fragments(), [this, &dk_ptr, &row_list] (frozen_mutation_fragment& fmf) mutable {
// Keep the mutation_fragment in repair_row as an
// optimization to avoid unfreeze again when
// mutation_fragment is needed by _repair_writer.do_write()
// to apply the repair_row to disk
auto mf = make_lw_shared<mutation_fragment>(fmf.unfreeze(*_schema));
auto hash = do_hash_for_mf(*dk_ptr, *mf);
position_in_partition pos(mf->position());
row_list.push_back(repair_row(std::move(fmf), std::move(pos), dk_ptr, std::move(hash), std::move(mf)));
});
if (_repair_master) {
return do_for_each(x.get_mutation_fragments(), [this, &dk_ptr, &row_list] (frozen_mutation_fragment& fmf) mutable {
_metrics.rx_row_nr += 1;
_metrics.rx_row_bytes += fmf.representation().size();
// Keep the mutation_fragment in repair_row as an
// optimization to avoid unfreeze again when
// mutation_fragment is needed by _repair_writer.do_write()
// to apply the repair_row to disk
auto mf = make_lw_shared<mutation_fragment>(fmf.unfreeze(*_schema));
auto hash = do_hash_for_mf(*dk_ptr, *mf);
position_in_partition pos(mf->position());
row_list.push_back(repair_row(std::move(fmf), std::move(pos), dk_ptr, std::move(hash), std::move(mf)));
});
} else {
last_mf = {};
return do_for_each(x.get_mutation_fragments(), [this, &dk_ptr, &row_list, &last_mf, &cmp] (frozen_mutation_fragment& fmf) mutable {
_metrics.rx_row_nr += 1;
_metrics.rx_row_bytes += fmf.representation().size();
auto mf = make_lw_shared<mutation_fragment>(fmf.unfreeze(*_schema));
position_in_partition pos(mf->position());
// If the mutation_fragment has the same position as
// the last mutation_fragment, it means they are the
// same row with different contents. We can not feed
// such rows into the sstable writer. Instead we apply
// the mutation_fragment into the previous one.
if (last_mf && cmp(last_mf->position(), pos) == 0 && last_mf->mergeable_with(*mf)) {
last_mf->apply(*_schema, std::move(*mf));
} else {
last_mf = mf;
// On repair follower node, only decorated_key_with_hash and the mutation_fragment inside repair_row are used.
row_list.push_back(repair_row({}, {}, dk_ptr, {}, std::move(mf)));
}
});
}
}).then([&row_list] {
return std::move(row_list);
});
@@ -1084,29 +1170,28 @@ public:
// RPC API
future<>
repair_row_level_start(gms::inet_address remote_node, sstring ks_name, sstring cf_name, dht::token_range range) {
repair_row_level_start(gms::inet_address remote_node, sstring ks_name, sstring cf_name, dht::token_range range, table_schema_version schema_version) {
if (remote_node == _myip) {
return make_ready_future<>();
}
stats().rpc_call_nr++;
return netw::get_local_messaging_service().send_repair_row_level_start(msg_addr(remote_node),
_repair_meta_id, std::move(ks_name), std::move(cf_name), std::move(range), _algo, _max_row_buf_size, _seed,
_master_node_shard_config.shard, _master_node_shard_config.shard_count, _master_node_shard_config.ignore_msb, _master_node_shard_config.partitioner_name);
_master_node_shard_config.shard, _master_node_shard_config.shard_count, _master_node_shard_config.ignore_msb, _master_node_shard_config.partitioner_name, std::move(schema_version));
}
// RPC handler
static future<>
repair_row_level_start_handler(gms::inet_address from, uint32_t repair_meta_id, sstring ks_name, sstring cf_name,
repair_row_level_start_handler(gms::inet_address from, uint32_t src_cpu_id, uint32_t repair_meta_id, sstring ks_name, sstring cf_name,
dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size,
uint64_t seed, shard_config master_node_shard_config) {
uint64_t seed, shard_config master_node_shard_config, table_schema_version schema_version) {
if (!_sys_dist_ks->local_is_initialized() || !_view_update_generator->local_is_initialized()) {
return make_exception_future<>(std::runtime_error(format("Node {} is not fully initialized for repair, try again later",
utils::fb_utilities::get_broadcast_address())));
}
rlogger.debug(">>> Started Row Level Repair (Follower): local={}, peers={}, repair_meta_id={}, keyspace={}, cf={}, range={}",
utils::fb_utilities::get_broadcast_address(), from, repair_meta_id, ks_name, cf_name, range);
insert_repair_meta(from, repair_meta_id, std::move(ks_name), std::move(cf_name), std::move(range), algo, max_row_buf_size, seed, std::move(master_node_shard_config));
return make_ready_future<>();
rlogger.debug(">>> Started Row Level Repair (Follower): local={}, peers={}, repair_meta_id={}, keyspace={}, cf={}, schema_version={}, range={}",
utils::fb_utilities::get_broadcast_address(), from, repair_meta_id, ks_name, cf_name, schema_version, range);
return insert_repair_meta(from, src_cpu_id, repair_meta_id, std::move(range), algo, max_row_buf_size, seed, std::move(master_node_shard_config), std::move(schema_version));
}
// RPC API
@@ -1313,14 +1398,15 @@ future<> repair_init_messaging_service_handler(repair_service& rs, distributed<d
});
ms.register_repair_row_level_start([] (const rpc::client_info& cinfo, uint32_t repair_meta_id, sstring ks_name,
sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed,
unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name) {
unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version) {
auto src_cpu_id = cinfo.retrieve_auxiliary<uint32_t>("src_cpu_id");
auto from = cinfo.retrieve_auxiliary<gms::inet_address>("baddr");
return smp::submit_to(src_cpu_id % smp::count, [from, repair_meta_id, ks_name, cf_name,
range, algo, max_row_buf_size, seed, remote_shard, remote_shard_count, remote_ignore_msb, remote_partitioner_name] () mutable {
return repair_meta::repair_row_level_start_handler(from, repair_meta_id, std::move(ks_name),
return smp::submit_to(src_cpu_id % smp::count, [from, src_cpu_id, repair_meta_id, ks_name, cf_name,
range, algo, max_row_buf_size, seed, remote_shard, remote_shard_count, remote_ignore_msb, remote_partitioner_name, schema_version] () mutable {
return repair_meta::repair_row_level_start_handler(from, src_cpu_id, repair_meta_id, std::move(ks_name),
std::move(cf_name), std::move(range), algo, max_row_buf_size, seed,
shard_config{remote_shard, remote_shard_count, remote_ignore_msb, std::move(remote_partitioner_name)});
shard_config{remote_shard, remote_shard_count, remote_ignore_msb, std::move(remote_partitioner_name)},
schema_version);
});
});
ms.register_repair_row_level_stop([] (const rpc::client_info& cinfo, uint32_t repair_meta_id,
@@ -1608,8 +1694,12 @@ public:
dht::global_partitioner().sharding_ignore_msb(),
dht::global_partitioner().name()
};
auto s = _cf.schema();
auto schema_version = s->version();
repair_meta master(_ri.db,
_cf,
s,
_range,
algorithm,
_max_row_buf_size,
@@ -1622,12 +1712,13 @@ public:
// All nodes including the node itself.
_all_nodes.insert(_all_nodes.begin(), master.myip());
rlogger.debug(">>> Started Row Level Repair (Master): local={}, peers={}, repair_meta_id={}, keyspace={}, cf={}, range={}, seed={}",
master.myip(), _all_live_peer_nodes, master.repair_meta_id(), _ri.keyspace, _cf_name, _range, _seed);
rlogger.debug(">>> Started Row Level Repair (Master): local={}, peers={}, repair_meta_id={}, keyspace={}, cf={}, schema_version={}, range={}, seed={}",
master.myip(), _all_live_peer_nodes, master.repair_meta_id(), _ri.keyspace, _cf_name, schema_version, _range, _seed);
try {
parallel_for_each(_all_nodes, [&, this] (const gms::inet_address& node) {
return master.repair_row_level_start(node, _ri.keyspace, _cf_name, _range).then([&] () {
return master.repair_row_level_start(node, _ri.keyspace, _cf_name, _range, schema_version).then([&] () {
return master.repair_get_estimated_partitions(node).then([this, node] (uint64_t partitions) {
rlogger.trace("Get repair_get_estimated_partitions for node={}, estimated_partitions={}", node, partitions);
_estimated_partitions += partitions;
@@ -1677,19 +1768,7 @@ public:
future<> repair_cf_range_row_level(repair_info& ri,
sstring cf_name, dht::token_range range,
const std::vector<gms::inet_address>& all_peer_nodes) {
auto all_live_peer_nodes = boost::copy_range<std::vector<gms::inet_address>>(all_peer_nodes |
boost::adaptors::filtered([] (const gms::inet_address& node) { return gms::get_local_gossiper().is_alive(node); }));
if (all_live_peer_nodes.size() != all_peer_nodes.size()) {
rlogger.warn("Repair for range={} is partial, peer nodes={}, live peer nodes={}",
range, all_peer_nodes, all_live_peer_nodes);
ri.nr_failed_ranges++;
}
if (all_live_peer_nodes.empty()) {
rlogger.info(">>> Skipped Row Level Repair (Master): local={}, peers={}, keyspace={}, cf={}, range={}",
utils::fb_utilities::get_broadcast_address(), all_peer_nodes, ri.keyspace, cf_name, range);
return make_ready_future<>();
}
return do_with(row_level_repair(ri, std::move(cf_name), std::move(range), std::move(all_live_peer_nodes)), [] (row_level_repair& repair) {
return do_with(row_level_repair(ri, std::move(cf_name), std::move(range), all_peer_nodes), [] (row_level_repair& repair) {
return repair.run();
});
}

View File

@@ -31,7 +31,6 @@
#include <boost/version.hpp>
#include <sys/sdt.h>
#include "read_context.hh"
#include "schema_upgrader.hh"
#include "dirty_memory_manager.hh"
#include "cache_flat_mutation_reader.hh"
#include "real_dirty_memory_accounter.hh"
@@ -349,9 +348,7 @@ future<> read_context::create_underlying(bool skip_first_fragment, db::timeout_c
static flat_mutation_reader read_directly_from_underlying(read_context& reader) {
flat_mutation_reader res = make_delegating_reader(reader.underlying().underlying());
if (reader.schema()->version() != reader.underlying().underlying().schema()->version()) {
res = transform(std::move(res), schema_upgrader(reader.schema()));
}
res.upgrade_schema(reader.schema());
return make_nonforwardable(std::move(res), true);
}
@@ -928,7 +925,6 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)
});
return seastar::async([this, &m, updater = std::move(updater), real_dirty_acc = std::move(real_dirty_acc)] () mutable {
coroutine update;
size_t size_entry;
// In case updater fails, we must bring the cache to consistency without deferring.
auto cleanup = defer([&m, this] {
@@ -936,6 +932,7 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)
_prev_snapshot_pos = {};
_prev_snapshot = {};
});
coroutine update; // Destroy before cleanup to release snapshots before invalidating.
partition_presence_checker is_present = _prev_snapshot->make_partition_presence_checker();
while (!m.partitions.empty()) {
with_allocator(_tracker.allocator(), [&] () {
@@ -1007,8 +1004,10 @@ future<> row_cache::update(external_updater eu, memtable& m) {
if (cache_i != partitions_end() && cache_i->key().equal(*_schema, mem_e.key())) {
cache_entry& entry = *cache_i;
upgrade_entry(entry);
assert(entry._schema == _schema);
_tracker.on_partition_merge();
return entry.partition().apply_to_incomplete(*_schema, std::move(mem_e.partition()), *mem_e.schema(), _tracker.memtable_cleaner(),
mem_e.upgrade_schema(_schema, _tracker.memtable_cleaner());
return entry.partition().apply_to_incomplete(*_schema, std::move(mem_e.partition()), _tracker.memtable_cleaner(),
alloc, _tracker.region(), _tracker, _underlying_phase, acc);
} else if (cache_i->continuous()
|| with_allocator(standard_allocator(), [&] { return is_present(mem_e.key()); })
@@ -1020,7 +1019,8 @@ future<> row_cache::update(external_updater eu, memtable& m) {
entry->set_continuous(cache_i->continuous());
_tracker.insert(*entry);
_partitions.insert_before(cache_i, *entry);
return entry->partition().apply_to_incomplete(*_schema, std::move(mem_e.partition()), *mem_e.schema(), _tracker.memtable_cleaner(),
mem_e.upgrade_schema(_schema, _tracker.memtable_cleaner());
return entry->partition().apply_to_incomplete(*_schema, std::move(mem_e.partition()), _tracker.memtable_cleaner(),
alloc, _tracker.region(), _tracker, _underlying_phase, acc);
} else {
return make_empty_coroutine();
@@ -1117,8 +1117,8 @@ future<> row_cache::invalidate(external_updater eu, dht::partition_range_vector&
});
}
void row_cache::evict(const dht::partition_range& range) {
invalidate_unwrapped(range);
void row_cache::evict() {
while (_tracker.region().evict_some() == memory::reclaiming_result::reclaimed_something) {}
}
void row_cache::invalidate_unwrapped(const dht::partition_range& range) {
@@ -1205,8 +1205,11 @@ void rows_entry::on_evicted(cache_tracker& tracker) noexcept {
partition_version& pv = partition_version::container_of(mutation_partition::container_of(
mutation_partition::rows_type::container_of_only_member(*it)));
if (pv.is_referenced_from_entry()) {
cache_entry& ce = cache_entry::container_of(partition_entry::container_of(pv));
ce.on_evicted(tracker);
partition_entry& pe = partition_entry::container_of(pv);
if (!pe.is_locked()) {
cache_entry& ce = cache_entry::container_of(pe);
ce.on_evicted(tracker);
}
}
}
}
@@ -1227,9 +1230,8 @@ flat_mutation_reader cache_entry::do_read(row_cache& rc, read_context& reader) {
auto snp = _pe.read(rc._tracker.region(), rc._tracker.cleaner(), _schema, &rc._tracker, reader.phase());
auto ckr = query::clustering_key_filter_ranges::get_ranges(*_schema, reader.slice(), _key.key());
auto r = make_cache_flat_mutation_reader(_schema, _key, std::move(ckr), rc, reader.shared_from_this(), std::move(snp));
if (reader.schema()->version() != _schema->version()) {
r = transform(std::move(r), schema_upgrader(reader.schema()));
}
r.upgrade_schema(rc.schema());
r.upgrade_schema(reader.schema());
return r;
}
@@ -1238,7 +1240,7 @@ const schema_ptr& row_cache::schema() const {
}
void row_cache::upgrade_entry(cache_entry& e) {
if (e._schema != _schema) {
if (e._schema != _schema && !e.partition().is_locked()) {
auto& r = _tracker.region();
assert(!r.reclaiming_enabled());
with_allocator(r.allocator(), [this, &e] {

View File

@@ -549,12 +549,12 @@ public:
future<> invalidate(external_updater, const dht::partition_range& = query::full_partition_range);
future<> invalidate(external_updater, dht::partition_range_vector&&);
// Evicts entries from given range in cache.
// Evicts entries from cache.
//
// Note that this does not synchronize with the underlying source,
// it is assumed that the underlying source didn't change.
// If it did, use invalidate() instead.
void evict(const dht::partition_range& = query::full_partition_range);
void evict();
size_t partitions() const {
return _partitions.size();

View File

@@ -69,19 +69,30 @@ table_schema_version schema_mutations::digest() const {
}
md5_hasher h;
db::schema_tables::feed_hash_for_schema_digest(h, _columnfamilies);
db::schema_tables::feed_hash_for_schema_digest(h, _columns);
db::schema_features sf = db::schema_features::full();
// Disable this feature so that the digest remains compactible with Scylla
// versions prior to this feature.
// This digest affects the table schema version calculation and it's important
// that all nodes arrive at the same table schema version to avoid needless schema version
// pulls. Table schema versions are calculated on boot when we don't yet
// know all the cluster features, so we could get different table versions after reboot
// in an already upgraded cluster.
sf.remove<db::schema_feature::DIGEST_INSENSITIVE_TO_EXPIRY>();
db::schema_tables::feed_hash_for_schema_digest(h, _columnfamilies, sf);
db::schema_tables::feed_hash_for_schema_digest(h, _columns, sf);
if (_view_virtual_columns && !_view_virtual_columns->partition().empty()) {
db::schema_tables::feed_hash_for_schema_digest(h, *_view_virtual_columns);
db::schema_tables::feed_hash_for_schema_digest(h, *_view_virtual_columns, sf);
}
if (_indices && !_indices->partition().empty()) {
db::schema_tables::feed_hash_for_schema_digest(h, *_indices);
db::schema_tables::feed_hash_for_schema_digest(h, *_indices, sf);
}
if (_dropped_columns && !_dropped_columns->partition().empty()) {
db::schema_tables::feed_hash_for_schema_digest(h, *_dropped_columns);
db::schema_tables::feed_hash_for_schema_digest(h, *_dropped_columns, sf);
}
if (_scylla_tables) {
db::schema_tables::feed_hash_for_schema_digest(h, *_scylla_tables);
db::schema_tables::feed_hash_for_schema_digest(h, *_scylla_tables, sf);
}
return utils::UUID_gen::get_name_UUID(h.finalize());
}

View File

@@ -263,11 +263,9 @@ global_schema_ptr::global_schema_ptr(const global_schema_ptr& o)
: global_schema_ptr(o.get())
{ }
global_schema_ptr::global_schema_ptr(global_schema_ptr&& o) {
global_schema_ptr::global_schema_ptr(global_schema_ptr&& o) noexcept {
auto current = engine().cpu_id();
if (o._cpu_of_origin != current) {
throw std::runtime_error("Attempted to move global_schema_ptr across shards");
}
assert(o._cpu_of_origin == current);
_ptr = std::move(o._ptr);
_cpu_of_origin = current;
}

View File

@@ -173,7 +173,7 @@ public:
// The other may come from a different shard.
global_schema_ptr(const global_schema_ptr& other);
// The other must come from current shard.
global_schema_ptr(global_schema_ptr&& other);
global_schema_ptr(global_schema_ptr&& other) noexcept;
// May be invoked across shards. Always returns an engaged pointer.
schema_ptr get() const;
operator schema_ptr() const { return get(); }

View File

@@ -231,9 +231,15 @@ ar = tarfile.open(args.output, mode='w|gz')
pathlib.Path('build/SCYLLA-RELOCATABLE-FILE').touch()
ar.add('build/SCYLLA-RELOCATABLE-FILE', arcname='SCYLLA-RELOCATABLE-FILE')
ar.add('dist/redhat/python3')
ar.add('dist/debian/python3')
ar.add('build/python3/SCYLLA-RELEASE-FILE', arcname='SCYLLA-RELEASE-FILE')
ar.add('build/python3/SCYLLA-VERSION-FILE', arcname='SCYLLA-VERSION-FILE')
ar.add('build/SCYLLA-PRODUCT-FILE', arcname='SCYLLA-PRODUCT-FILE')
for p in ['pyhton3-libs'] + packages:
pdir = pathlib.Path('/usr/share/licenses/{}/'.format(p))
if pdir.exists():
for f in pdir.glob('*'):
ar.add(f, arcname='licenses/{}/{}'.format(p, f.name))
for f in file_list:
copy_file_to_python_env(ar, f)

View File

@@ -61,6 +61,7 @@ args = ap.parse_args()
executables = ['build/{}/scylla'.format(args.mode),
'build/{}/iotune'.format(args.mode),
'/usr/bin/patchelf',
'/usr/bin/lscpu',
'/usr/bin/gawk',
'/usr/bin/gzip',
@@ -76,6 +77,9 @@ libs = {}
for exe in executables:
libs.update(ldd(exe))
# manually add libthread_db for debugging thread
libs.update({'libthread_db.so.1': '/lib64/libthread_db-1.0.so'})
ld_so = libs['ld.so']
have_gnutls = any([lib.startswith('libgnutls.so')
@@ -93,56 +97,9 @@ ar = tarfile.open(fileobj=gzip_process.stdin, mode='w|')
pathlib.Path('build/SCYLLA-RELOCATABLE-FILE').touch()
ar.add('build/SCYLLA-RELOCATABLE-FILE', arcname='SCYLLA-RELOCATABLE-FILE')
# This thunk is a shell script that arranges for the executable to be invoked,
# under the following conditions:
#
# - the same argument vector is passed to the executable, including argv[0]
# - the executable name (/proc/pid/comm, shown in top(1)) is the same
# - the dynamic linker is taken from this package rather than the executable's
# default (which is hardcoded to point to /lib64/ld-linux-x86_64.so or similar)
# - LD_LIBRARY_PATH points to the lib/ directory so shared library dependencies
# are satisified from there rather than the system default (e.g. /lib64)
# To do that, the dynamic linker is invoked using a symbolic link named after the
# executable, not its standard name. We use "bash -a" to set argv[0].
# The full tangled web looks like:
#
# foobar/bin/scylla a shell script invoking everything
# foobar/libexec/scylla.bin the real binary
# foobar/libexec/scylla a symlink to ../lib/ld.so
# foobar/libreloc/ld.so the dynamic linker
# foobar/libreloc/lib... all the other libraries
# the transformations (done by the thunk and symlinks) are:
#
# bin/scylla args -> libexec/scylla libexec/scylla.bin args -> lib/ld.so libexec/scylla.bin args
thunk = b'''\
#!/bin/bash
x="$(readlink -f "$0")"
b="$(basename "$x")"
d="$(dirname "$x")/.."
ldso="$d/libexec/$b"
realexe="$d/libexec/$b.bin"
export GNUTLS_SYSTEM_PRIORITY_FILE="${GNUTLS_SYSTEM_PRIORITY_FILE-$d/libreloc/gnutls.config}"
LD_LIBRARY_PATH="$d/libreloc" exec -a "$0" "$ldso" "$realexe" "$@"
'''
for exe in executables:
basename = os.path.basename(exe)
ar.add(exe, arcname='libexec/' + basename + '.bin')
ti = tarfile.TarInfo(name='bin/' + basename)
ti.size = len(thunk)
ti.mode = 0o755
ti.mtime = os.stat(exe).st_mtime
ar.addfile(ti, fileobj=io.BytesIO(thunk))
ti = tarfile.TarInfo(name='libexec/' + basename)
ti.type = tarfile.SYMTYPE
ti.linkname = '../libreloc/ld.so'
ti.mtime = os.stat(exe).st_mtime
ar.addfile(ti)
ar.add(exe, arcname='libexec/' + basename)
for lib, libfile in libs.items():
ar.add(libfile, arcname='libreloc/' + lib)
if have_gnutls:

Some files were not shown because too many files have changed in this diff Show More