Commit Graph

7595 Commits

Author SHA1 Message Date
Glauber Costa
3c988e8240 perf_sstable: use current scylla default directory
When this tool was written, we were still using /var/lib/cassandra as a default
location. We should update it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2015-12-09 17:46:31 +02:00
Avi Kivity
01c3670def Merge seastar upstream
* seastar 5dc22fa...c5e595b (3):
  > memory: be less strict about NUMA bindings
  > reactor: let the resource code specify the default memory reserve
  > resource: reserve even more memory when hwloc is compiled in

Fixes #642
2015-12-09 16:47:47 +02:00
Asias He
66938ac129 streaming: Add retransmit logic for streaming verbs
Retransmit streaming related verbs and give up in 5 minutes.

Tested with:

  lein test :only cassandra.batch-test/batch-halves-decommission

Fixes #568.
2015-12-09 15:12:36 +02:00
Avi Kivity
14794af260 Merge seastar upstream
* seastar 9f9182e...5dc22fa (1):
  > future: add repeat_until_value(): repeat an action until it returns a value
2015-12-09 15:11:59 +02:00
Avi Kivity
213700e42f Merge seastar upstream
* seastar d40453b...9f9182e (5):
  > Merge "Sleep mode support"
  > future: add futurize<T>::from_tuple(tuple<T>)
  > tls: Add missing destructor for dh_params::impl, fixes ASAN error
  > tls/socket fix: Add missing noexcept to constructor/move
  > Merge "Initial SSL/TLS socket support" from Calle
2015-12-09 11:01:13 +02:00
Avi Kivity
204610ac61 Merge "Make LSA more large-allocation-friendly" from Paweł
"This series attempts to make LSA more friendly for large (i.e. bigger
than LSA segment) allocations. It is achieved by introducing segment
zones – large, contiguous areas of segments and using them to allocate
segments instead of calling malloc() directly.
Zones can be shrunk when needed to reclaim memory and segments can be
migrated either to reduce number of zone or to defragment one in order
to be able to shrink it. LSA tries to keep all segments at the lower
addresses and reclaims memory starting from the zones in the highest
parts of the address space."
2015-12-09 10:49:23 +02:00
Avi Kivity
883074e936 Merge "Fix replace_node support" from Asias
Also:

[PATCH scylla v1 0/7] gossip mark node down fix + cleanup
[PATCH scylla v1 0/2] Refuse decommissioned node to rejoin
[PATCH scylla] storage_service: Fix added node not showing up in nodetool in status joining
2015-12-09 10:42:52 +02:00
Paweł Dziepak
8ba66bb75d managed_bytes: fix copy size in move constructor
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-12-09 10:38:28 +02:00
Asias He
b63d49c773 storage_service: Log removing replaced endpoint from system.peers
This info is important when replacing a node. Useful for debugging.
2015-12-09 12:30:52 +08:00
Asias He
d26c7e671d storage_service: Enable commented out code in handle_state_normal
Add current_owner to endpoints_to_remove if endpoint and current_owner
have the same token and endpoint is newer than current_owner.
2015-12-09 12:30:52 +08:00
Asias He
3793bb7be1 token_metadata: Add get_endpoint_to_token_map_for_reading 2015-12-09 12:30:52 +08:00
Asias He
1cc7887ffb token_metadata: Do nothing if tokens is empty.
When replacing a node, we might ignore the tokens so that the tokens is
empty. In this case, we will have

   std::unordered_map<inet_address, std::unordered_set<token>> = {ip, {}}

passed to token_metadata::update_normal_tokens(std::unordered_map<inet_address,
std::unordered_set<token>>& endpoint_tokens)

and hit the assert

   assert(!tokens.empty());
2015-12-09 12:30:52 +08:00
Asias He
e79c85964f system_keyspace: Flush system.peers in remove_endpoint
1) Start node 1, node 2, node 3
2) Stop  node 3
3) Start node 4 to replace node 3
4) Kill  node 4 (removal of node 3 in system.peers is not flushed to disk)
5) Start node 4 (will load node 3's token and host_id info in bootup)

This makes

   "Token .* changing ownership from 127.0.0.3 to 127.0.0.4"

messages printed again in step 5) which are not expected, which fails the dtest

   FAIL: replace_first_boot_test (replace_address_test.TestReplaceAddress)
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "scylla-dtest/replace_address_test.py",
   line 220, in replace_first_boot_test
       self.assertEqual(len(movedTokensList), numNodes)
   AssertionError: 512 != 256
2015-12-09 12:30:52 +08:00
Asias He
110a18987e token_metadata: Print Token changing ownership from
Needed by test.
2015-12-09 12:30:52 +08:00
Asias He
906f670a86 gossip: Print node status in handle_major_state_change
It is useful to know the STATUS value when debugging.
2015-12-09 12:29:15 +08:00
Asias He
a0325a5528 gossip: Simplify is_shutdown and friends.
Use the newly added helper get_gossip_status.
2015-12-09 12:29:15 +08:00
Asias He
9d4382c626 gossip: Introduce get_gossip_status
Get value of application_state::STATUS.
2015-12-09 12:29:15 +08:00
Asias He
5a65d8bcdd gossip: Fix endless marking a node down
In commit 56df32ba56 (gossip: Mark node as
dead even if already left). A node liveness check is missed.

Fix it up.

Before: (mark a node down multiple times)

[Tue Dec  8 12:16:33 2015] INFO  [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec  8 12:16:33 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
[Tue Dec  8 12:16:34 2015] INFO  [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec  8 12:16:34 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
[Tue Dec  8 12:16:35 2015] INFO  [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec  8 12:16:35 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
[Tue Dec  8 12:16:36 2015] INFO  [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec  8 12:16:36 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead

After: (mark a node down only one time)

[Tue Dec  8 12:28:36 2015] INFO  [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec  8 12:28:36 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
2015-12-09 12:29:15 +08:00
Asias He
fa3c84db10 gossip: Kill default constructor for versioned_value
The only reason we needed it is to make
   _application_state[key] = value
work.

With the current default constructor, we increase the version number
needlessly. To fix and to be safe, remove the default constructor
completely.
2015-12-09 12:29:15 +08:00
Asias He
52a5e954f9 gossip: Pass const ref for versioned_value in on_change and before_change 2015-12-09 12:29:15 +08:00
Asias He
3308430343 storage_service: Make before_change and on_change log print more informative
- Make before_change and on_change print the versioned_value
- Print endpoint address first in handle_state_* and
  on_change and friends.
2015-12-09 12:29:15 +08:00
Asias He
ccbd801f40 storage_service: Fix decommissioned nodes are willing to rejoin the cluster if restarted
Backport: CASSANDRA-8801

a53a6ce Decommissioned nodes will not rejoin the cluster.

Tested with:
topology_test.py:TestTopology.decommissioned_node_cant_rejoin_test
2015-12-09 10:43:51 +08:00
Asias He
b3dd2d976a storage_service: Simplify prepare_to_join with seastar thread 2015-12-09 10:43:51 +08:00
Asias He
e9a4d93d1b storage_service: Fix added node not showing up in nodetool in status joining
The get_token_endpoint API should return a map of tokens to endpoints,
including the bootstrapping ones.

Use get_local_storage_service().get_token_to_endpoint_map() for it.

$ nodetool -p 7100 status

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns    Host ID Rack
UN  127.0.0.1  12645      256     ?  eac5b6cf-5fda-4447-8104-a7bf3b773aba  rack1
UN  127.0.0.2  12635      256     ?  2ad1b7df-c8ad-4cbc-b1f1-059121d2f0c7  rack1
UN  127.0.0.3  12624      256     ?  61f82ea7-637d-4083-acc9-567e0c01b490  rack1
UJ  127.0.0.4  ?          256     ?  ced2725e-a5a4-4ac3-86de-e1c66cecfb8d  rack1

Fixes #617
2015-12-09 10:43:51 +08:00
Paweł Dziepak
63bdf52803 tests/lsa: add large allocations test
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-12-08 23:56:46 +01:00
Tomasz Grabiec
d68a8b5349 Merge branch 'dev/amnon/index_summary_size_v2' from seastar-dev.git
API for getting sstable index summary memory footprint from Amnon
2015-12-08 20:03:39 +01:00
Paweł Dziepak
73a1213160 scylla-gdb.py: print lsa zones
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-12-08 19:31:40 +01:00
Paweł Dziepak
0d66300d43 lsa: add more counters
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-12-08 19:31:40 +01:00
Paweł Dziepak
83b004b2fb lsa: avoid fragmenting memory
Originally, lsa allocated each segment independently what could result
in high memory fragmentation. As a result many compaction and eviction
passes may be needed to release a sufficiently big contiguous memory
block.

These problems are solved by introduction of segment zones, contiguous
groups of segments. All segments are allocated from zones and the
algorithm tries to keep the number of zones to a minimum. Moreover,
segments can be migrated between zones or inside a zone in order to deal
with fragmentation inside zone.

Segment zones can be shrunk but cannot grow. Segment pool keeps a tree
containing all zones ordered by their base addresses. This tree is used
only by the memory reclamer. There is also a list of zones that have
at least one free segments that is used during allocation.

Segment allocation doesn't have any preferences which segment (and zone)
to choose. Each zone contains a free list of unused segments. If there
are no zones with free segments a new one is created.

Segment reclamation migrates segments from the zones higher in memory
to the ones at lower addresses. The remaining zones are shrunk until the
requested number of segments is reclaimed.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-12-08 19:31:40 +01:00
Paweł Dziepak
6c4a54fb0b tests: add tests for utils::dynamic_bitset
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-12-08 19:31:40 +01:00
Paweł Dziepak
2fb14a10b6 utils: add dynamic_bitset
A dynamic bitset implementation that provides functions to search for
both set and cleared bits in both directions.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-12-08 19:31:40 +01:00
Paweł Dziepak
40dda261f2 lsa: maintain segment to region mapping
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-12-08 19:31:40 +01:00
Paweł Dziepak
c4e71bac7f tests/row_cache_alloc_stress: make sure that allocation fails
Currently test case "Testing reading when memory can't be reclaimed."
assumes that the allocation section used by row cache upon entering
will require more free memory than there is available (inc. evictable).
However, the reserves used by allocation section are adjusted
dynamically and depend solely on previous events. In other words there
is no guarantee that the reserve would be increased so much that the
allocation will fail.

The problem is solved by adding another allocation that is guaranteed
to be bigger than all evictable and free memory.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-12-08 19:31:40 +01:00
Paweł Dziepak
2e94086a2c lsa: use bi::list to implement segment_stack
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-12-08 19:31:40 +01:00
Tomasz Grabiec
6ead7a0ec5 Merge tag 'large-blobs/v3' from git@github.com:avikivity/scylla.git
Scattering of blobs from Avi:

This patchset converts the stack to scatter managed_bytes in lsa memory,
allowing large blobs (and collections) to be stored in memtable and cache.
Outside memtable/cache, they are still stored sequentially, but it is assumed
that the number of transient objects is bounded.

The approach taken here is to scatter managed_bytes data in multiple
blob_storage objects, but to linearize them back when accessing (for
example, to merge cells).  This allows simple access through the normal
bytes_view.  It causes an extra two copies, but copying a megabyte twice
is cheap compared to accessing a megabyte's worth of small cells, so
per-byte throughput is increased.

Testing show that lsa large object space is kept at zero, but throughput
is bad because Scylla easily overwhelms the disk with large blobs; we'll
need Glauber's throttling patches or a really fast disk to see good
throughput with this.
2015-12-08 19:15:13 +01:00
Avi Kivity
5c5331d910 tests: test large blobs in memtables 2015-12-08 15:17:09 +02:00
Avi Kivity
0c2fba7e0b lsa: advertize our preferred maximum allocation size
Let managed_bytes know that allocating below a tenth of the segment size is
the right thing to do.
2015-12-08 15:17:09 +02:00
Avi Kivity
f9e2a9a086 mutation_partition: work on linearized atomic_cell_or_mutation objects
Ensure that when we examine atomic_cell_or_mutation objects for merging,
that they are contiguous in memory.  When we are done we scatter them again.
2015-12-08 15:17:09 +02:00
Avi Kivity
ad975ad629 atomic_cell_or_collection: linearize(), unlinearize()
Add linearize() and unlinearize() methods that allow making an
atomic_cell_or_collection object temporarily contiguous, so we can examine
it as a bytes_view.
2015-12-08 15:17:09 +02:00
Avi Kivity
13324607e6 managed_bytes: conform to allocation_strategy's max_preferred_allocation_size
Instead of allocating a single blob_storage, chain multiple blob_storage
objects in a list, each limited not to exceed the allocation_strategy's
max_preferred_allocation_size.  This allows lsa to allocate each blob_storage
object as an lsa managed object that can be migrated in memory.

Also provide linearize()/scatter() methods that can be used to temporarily
consolidate the storage into a single blob_storage.  This makes the data
contiguous, so we can use a regular bytes_view to examine it.
2015-12-08 15:17:08 +02:00
Takuya ASADA
8c98e239d0 dist: use /etc/scylla as SCYLLA_CONF directory on AMI
We don't need copy /var/lib/scylla/conf on RAID anymore, it moved to /etc/scylla.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2015-12-08 11:09:12 +02:00
Avi Kivity
098136f4ab Merge "Convert serialization of query::result to use db::serializer<>" from Tomasz
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
2015-12-07 16:53:34 +02:00
Amnon Heiman
3ce7fa181c API: Add the implementation for index_summary_off_heap_memory
This adds the implementation for the index_summary_off_heap_memory for a
single column family and for all of them.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2015-12-07 15:15:39 +02:00
Amnon Heiman
e786f1d02f sstable: Add get_summary function
The get_summary method returns a const reference to the summary object.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2015-12-07 14:52:18 +02:00
Amnon Heiman
bae286a5b4 Add memory_footprint method to summary_ka
Similiar to origin, off heap memory, memory_footprint is the size of
queus multiply by the structure size.

memory_footprint is used by the API to report the memory that is taken
by the summary.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2015-12-07 14:52:18 +02:00
Amnon Heiman
2086c651ba column_family: get_snapshot_details should return empty map for no snapshots
If there is no snapshot directory for the specific column family,
get_snapshot_details should return an empty map.

This patch check that a directory exists before trying to iterate over
it.

Fixes #619

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2015-12-07 12:51:04 +01:00
Tomasz Grabiec
b43b5af894 Merge tag 'tgrabiec/make-future-values-nothrow-move-constructible-v3' from seastar-dev.git
Seastar's future<> now requires types to be nothrow move
constructible. This series makes Scylla code comply.
2015-12-07 10:43:18 +01:00
Tomasz Grabiec
95f515a6bd Move seastar submodule head
Scylla changes:
  sstable.cc: Remove file_exists() function which conflicts with seastar's

Amnon Heiman (2):
      reactor: Add file_exists method
      Add a wrapper for file_exists

Avi Kivity (2):
      Merge "Introduce shared_future" from Tomasz
      Merge ""scripts: a few fixes in posix_net_conf.sh" from Vlad

Gleb Natapov (3):
      rpc: not stop client in error state
      avoid allocation in parallel_for_each is there is nothing to do
      memory: fix size_to_idx calculation

Nadav Har'El (1):
      test: fix use-after-free in timertest

Pawe�� Dziepak (1):
      memory: use size instead of old_size to shrink memory block

Tomasz Grabiec (7):
      file: Mark move constructor as noexcept
      core: future: Add static asserts about type's noexcept guarantees
      core: future: Drop now redundant move_noexcept flag
      core: future_state: Make state getters non-destructive for non-rvalue-refs
      core: future: Make get_available_state() noexcept
      core: Introduce shared_future
      Make json_return_type movable

Vlad Zolotarov (8):
      scripts: posix_net_conf.sh: ban NIC IRQs from being moved by irqbalance
      scripts: posix_net_conf.sh: exclude CPU0 siblings from RPS
      scripts: posix_net_conf.sh: Configure XPS
      scripts: posix_net_conf.sh: Add a new mode for MQ NICs
      scripts: posix_net_conf.sh: increase some backlog sizes
      core: to_sstring(): cleanup
      core: to_sstring_strintf(): always use %g(or %lg) format for floating point values
      core: prevent explicit calls for to_sstring_sprintf()
2015-12-07 10:41:39 +01:00
Glauber Costa
79e70568d7 scylla-setup: do not add discard to the command line
In a recent discussion with the XFS developers, Dave Chinner recommended
us *not* to use discard, but rather issue fstrims explicitly. In machines
like Amazon's c3-class, the situation is made worse by the fact that discard
is not supported by the disk. Contrary to my intuition, adding the discard
mount option in such situation is *not* a nop and will just create load
for no reason.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-12-07 11:22:27 +02:00
Tomasz Grabiec
934d3f06d1 api: Make histogram reduction work on domain value instead of json objects
Objects extending json_base are not movable, so we won't be able to
pass them via future<>, which will assert that types are nothrow move
constructible.

This problem only affects httpd::utils_json::histogram, which is used
in map-reduce. This patch changes the aggregation to work on domain
value (utils::ihistrogram) instead of json objects.
2015-12-07 09:50:28 +01:00