Where all the other documentation is, and hence where people would be looking for it. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20191128144612.378244-1-bdenes@scylladb.com>
504 lines
16 KiB
Markdown
504 lines
16 KiB
Markdown
# In-Memory Representation
|
|
|
|
## Introduction
|
|
|
|
Instances of all C++ objects need to have their size fixed and known at compile
|
|
time. This is a significant restriction when it comes to minimising the memory
|
|
footprint of objects that are kept alive for a long amount of time.
|
|
|
|
Introducing a serialisation format that allows objects to be of a variable size
|
|
enables a range of possible optimisations, but manually writing serialisation
|
|
and deserialisation code for all required types would not only be tedious, but
|
|
also quite quickly result in an unmaintainable code.
|
|
|
|
IMR attempts to solve this problem by providing a general infrastructure for
|
|
serialisation and deserialisation with a goal of imposing as little limitations
|
|
as possible and generating as efficient code as possible.
|
|
|
|
## Basic concepts
|
|
|
|
### Contexts
|
|
|
|
The IMR expects that in many cases multiple objects will share certain
|
|
properties in a very predictable way. This can be taken advantage of by not
|
|
storing all the information necessary to correctly read IMR objects inside them,
|
|
but moving the shared part to an external state.
|
|
|
|
Fundamental and compound IMR types are designed with this in mind and many of
|
|
them require an external context to provide additional information necessary
|
|
during deserialisation.
|
|
|
|
*For example, `imr::buffer<Tag>` doesn't store its size anywhere, but requires
|
|
to be provided at deserialisation with a context object that implements
|
|
`size_of<Tag>()` member function.*
|
|
|
|
The IMR itself doesn't care about the source of the information the context
|
|
provide. It can come either from an external state that describes properties of
|
|
multiple instances of a particular IMR type or the context logic may read parts
|
|
of an IMR object itself. It is also worth noting that objects are not tied to
|
|
contexts in any way. The user code could use different context object (with
|
|
different internal logic) as long as the decisions it makes are the same and
|
|
sufficient for that particular use.
|
|
|
|
*For example, in a structure a `imr::buffer<Tag>` could be preceded by an
|
|
integer which determines its size. When a context is created it takes a pointer
|
|
to the structure, reads the size field and returns it when asked by the buffer
|
|
deserialiser.*
|
|
|
|
#### Context factories
|
|
|
|
In many cases a context is going to combine some external state and data stored
|
|
in an IMR object to provide information necessary to deserialise it. In order
|
|
to make creation of context more convenient context factories were introduced
|
|
which are objects that keep the external state and create an actual context
|
|
instance when given a pointer to an IMR object.
|
|
|
|
```c++
|
|
context_factory<ContextType, ExternalState> factory(external_state);
|
|
auto context = factory.create(pointer_to_an_imr_object);
|
|
```
|
|
|
|
### Views & mutable views
|
|
|
|
The interface for deserialising an IMR object is built around views. An IMR type
|
|
given context and a pointer to an IMR object returns a view object that
|
|
provides methods specific for that type. For fundamental types these functions
|
|
will usually provide a way of creating an equivalent C++ type. In case of
|
|
compound types their views will allow accessing members in a checked or
|
|
unchecked manner.
|
|
|
|
Some IMR types can be updated in place. A mutable view of their instances can
|
|
be created which aside from providing an interface for reading the object will
|
|
also allow it to be updated.
|
|
|
|
### Serialisers
|
|
|
|
IMR objects are serialised in three phases:
|
|
1. Computing the object sizes and recording any necessary memory allocations.
|
|
2. Allocating memory. This is the only phase that can fail.
|
|
3. Writing serialised data.
|
|
|
|
It is imperative that the decisions made in the first (sizing) and the third
|
|
(writing) phase are exactly the same. In order to make that easier to guarantee
|
|
the serialisation interfaces of many compound types accept a lambda template
|
|
which is first invoked with a sizer and later with a writer helper objects.
|
|
|
|
#### Placeholders
|
|
|
|
If all instances of an IMR type are guaranteed to be the same size it is
|
|
possible to defer their serialisation. The serialisation function can be given
|
|
a placeholder object which would later be set to an actual value.
|
|
|
|
#### Continuation hooks
|
|
|
|
Compound types can be nested, e.g. a member of a structure can be another
|
|
structure. However, as mentioned in the [Serialisers](#serialisers) section the
|
|
basic interface for serialisation is to accept lambda objects. In case of
|
|
nested structures this would require creating a nested lambdas which can be
|
|
inconvenient and result in a code that is hard to follow.
|
|
|
|
The solution is to introduce continuation hooks to serialisation helpers. The
|
|
outer structure serialisation helper provides a `serialize_nested()` member
|
|
function for serialising the inner one. The function returns a serialisation
|
|
helper for the nested structure which behaves as the stand-alone one except
|
|
that when `done()` is called it returns back a serialisation helper for the
|
|
outer structure.
|
|
|
|
```c++
|
|
return serializer
|
|
.serialize(...)
|
|
.serialize_nested()
|
|
.serialize(...)
|
|
.done()
|
|
.serialize(...)
|
|
.done();
|
|
```
|
|
|
|
### IMR types
|
|
|
|
#### Tags
|
|
|
|
IMR extensively uses tags to name members of a compound type or match an
|
|
instance of a type with an appropriate member function of the deserialisation
|
|
context.
|
|
|
|
Any type can be a tag, but it is customary to use incomplete classes with a
|
|
meaningful names for this purpose.
|
|
|
|
Reusing tags should be avoided in order to avoid clashes.
|
|
|
|
#### Type concept
|
|
|
|
Most IMR types follow a similar interface for serialisation and
|
|
deserialisation. There may be, however, some differences specific to a
|
|
particular type and its purpose.
|
|
|
|
```c++
|
|
struct imr_type {
|
|
static auto make_view(const uint_t* imr_object) noexcept;
|
|
static auto make_view(uint_t* imr_object) noexcept;
|
|
|
|
// Returns the size of a serialized IMR object
|
|
template<typename Context>
|
|
static size_t serialized_object_size(const uint8_t* imr_object, const Context& context) noexcept;
|
|
|
|
// Returns the size of a serialized object without actually serializing it.
|
|
// Arguments depend on the IMR type.
|
|
static size_t size_when_serialized(...) noexcept;
|
|
|
|
// Returns the size of a serialized object. Arguments depend on the actual
|
|
// IMR type.
|
|
static size_t serialize(uint8_t* out, ...) noexcept;
|
|
};
|
|
```
|
|
|
|
## Fundamental types
|
|
|
|
Type | Fixed size | Needs context | In-place update | Placeholders |
|
|
----------------------|:----------:|:-------------:|:--------------------------:|:------------:|
|
|
`flags` | Yes | No | Yes | Yes |
|
|
`pod` | Yes | No | Yes | Yes |
|
|
`buffer` | No | Yes | Yes (length cannot change) | No |
|
|
|
|
### `flags`
|
|
|
|
```c++
|
|
template<typename Tag>
|
|
struct imr::set_flag {
|
|
set_flag(bool v = true) noexcept;
|
|
};
|
|
|
|
template<typename Tags...>
|
|
struct imr::flags {
|
|
struct view {
|
|
template<typename Tag>
|
|
bool get() const noexcept;
|
|
|
|
template<typename Tag>
|
|
bool set(bool value = true) noexcept;
|
|
};
|
|
|
|
// Sets flags which tags are present in Tags1, the remaining flags are
|
|
// cleared.
|
|
template<typename... Tags1>
|
|
void serialize(imr::set_flag<Tags1>...) noexcept;
|
|
};
|
|
```
|
|
|
|
`flags` is a bitfield of named flags. Each declared flag uses exactly one bit
|
|
and the whole bitfield uses the smalles number of bytes possible to fit all
|
|
required bits.
|
|
|
|
`flags` is a fixed-size type that can be updated in place and doesn't require
|
|
and external context for deserialisation.
|
|
|
|
### `pod`
|
|
|
|
```c++
|
|
template<typename PODType>
|
|
struct imr::pod {
|
|
struct view {
|
|
PODType load() const noexcept;
|
|
void store(PODType object) noexcept;
|
|
};
|
|
|
|
void serialize(PODType object) noexcept;
|
|
};
|
|
```
|
|
|
|
`pod` is a POD object using the same representation as that defined by the C++
|
|
ABI including all the internal padding in case of compound types. It is
|
|
therefore recommended that `pod` is used exclusively to represent C++
|
|
fundamental types.
|
|
|
|
`pod` is a fixed-size type which can be updated in place.
|
|
It does not require an external context for deserialisation.
|
|
|
|
### `buffer`
|
|
|
|
```c++
|
|
template<typename Tag>
|
|
struct imr::buffer {
|
|
using view = bytes_view;
|
|
|
|
void serialize(bytes_view) noexcept;
|
|
};
|
|
```
|
|
|
|
`buffer` is a type representing an array of bytes of an arbitrary length.
|
|
|
|
`buffer` is a variable-size type which can be updated in-place as long as the
|
|
length of the buffer remains unchanged.
|
|
|
|
#### Context requirements
|
|
|
|
Deserialisation of `buffer` objects requires an external context providing the
|
|
following member:
|
|
|
|
```c++
|
|
size_t size_of<Tag>() const noexcept;
|
|
```
|
|
|
|
Where `Tag` is the same type as the one used as a template parameter for
|
|
`buffer`. `size_of()` is must return the length of the buffer.
|
|
|
|
|
|
## Compound types
|
|
|
|
### `tagged_type`
|
|
|
|
```c++
|
|
template<typename Tag, typename IMRType>
|
|
struct imr::tagged_type : IMRType { };
|
|
```
|
|
|
|
`tagged_type` is a thin wrapper around an IMR type that allows annotating it
|
|
with a tag without changing its views or serialisers in any way.
|
|
|
|
The main purpose of adding tags to the type name is to allow defining custom
|
|
methods for only some usage of a particular IMR type (see section
|
|
[Methods](#methods)).
|
|
|
|
### `optional`
|
|
|
|
```c++
|
|
template<typename Tag, typename IMRType>
|
|
struct imr::optional {
|
|
struct view {
|
|
template<typename Context>
|
|
IMRType::view get(const Context&);
|
|
};
|
|
|
|
template<typename... Args>
|
|
void serialize(Args&&...) noexcept;
|
|
};
|
|
```
|
|
|
|
`optional` represents an object that may not be present in some circumstances.
|
|
Information whether it does is not stored in any sort of metadata, but the
|
|
deserialisation context is expected to provide it.
|
|
|
|
#### Context requirements
|
|
|
|
```c++
|
|
bool is_present<Tag>() const noexcept;
|
|
```
|
|
|
|
Optionals require the deserialisation context to provide information whether
|
|
the underlying object is present.
|
|
|
|
### `variant`
|
|
|
|
```c++
|
|
template<typename Tag, typename IMRType>
|
|
struct imr::alternative;
|
|
|
|
template<typename Tag, typename... Alternatives>
|
|
struct imr::variant {
|
|
struct view {
|
|
template<typename AlternativeTag>
|
|
auto as() const noexcept;
|
|
|
|
template<typename Visitor>
|
|
decltype(auto) accept(Visitor&&) const;
|
|
};
|
|
|
|
template<typename AlternativeTag, typename... Args>
|
|
void serialize(Args&&...) noexcept;
|
|
};
|
|
```
|
|
|
|
`variant` is a compound type that has multiple alternatives, each being an
|
|
IMR type. At any time there is exactly one active alternative and it is the
|
|
only one that can be read. Variants do not store information about the active
|
|
alternative by themselves but rely on the external context to provide that
|
|
information.
|
|
|
|
#### Context requirements
|
|
|
|
```c++
|
|
imr::variant::alternative_index active_alternative_of<Tag>() const noexcept;
|
|
|
|
template<typename AlternativeTag>
|
|
auto context_for() const noexcept;
|
|
```
|
|
|
|
Variant requires deserialisation context to provide information which one of
|
|
its alternatives is active as well as context for deserialisation of that
|
|
alternative.
|
|
|
|
### `structure`
|
|
|
|
```c++
|
|
template<typename Tag, typename IMRType>
|
|
struct imr::member;
|
|
|
|
template<typename Tag, typename Members>
|
|
struct imr::structure {
|
|
struct view {
|
|
template<typename MemberTag>
|
|
auto get() noexcept;
|
|
|
|
template<typename MemberTag>
|
|
size_t offset_of() const noexcept;
|
|
};
|
|
|
|
template<typename Writer, typename... Args>
|
|
void serialize(Writer&&, Args&&...) noexcept;
|
|
|
|
template<typename Tag, typename Context>
|
|
auto get_member(uint8_t*, const Context&) const noexcept;
|
|
};
|
|
|
|
struct imr::structure_writer {
|
|
// For imr::optional:
|
|
template<typename... Args>
|
|
auto serialize(Args&&...);
|
|
|
|
auto skip();
|
|
|
|
// For imr::variant:
|
|
template<typename AlternativeTag, typename... Args>
|
|
auto serialize_as(Args&&...);
|
|
|
|
// For other types:
|
|
template<typename... Args>
|
|
auto serialize(Args&&...);
|
|
|
|
// After all members:
|
|
auto done() noexcept;
|
|
};
|
|
```
|
|
|
|
`structure` is a compound type that stores possibly multiple member objects
|
|
one after another. All members are always present and their IMR type is fixed.
|
|
|
|
The interface for serialising structures is based on serialisation helpers.
|
|
Methods `serialize()` and `size_when_serialized()` accept a lambda that would
|
|
be given a serialisation helper object. That object expects its member function
|
|
`serialize()` (there are slightly different ones for optional and variant
|
|
structure members) to be invoked with an arguments that would normally be passed
|
|
to methods `structure()` and `size_when_serialized()` of the type of the member
|
|
that is being serialised (in the order they appear in the structure definition).
|
|
After the last member has been serialised `done()` is to be called and its
|
|
return value returned from the lambda.
|
|
|
|
#### Context requirements
|
|
|
|
```c++
|
|
template<typename AlternativeTag>
|
|
auto context_for() const noexcept;
|
|
```
|
|
|
|
## Methods
|
|
|
|
IMR types can define destructor and mover methods. Their goal is to provide
|
|
similar functionality to C++ destructors and move constructors and thus enable
|
|
owning references inside IMR objects. Movers are also essential in ensuring
|
|
that all references are properly updated when LSA moves objects around.
|
|
|
|
The default mover and destructor for fundamental types does nothing. The
|
|
default methods for compound types and containers invokes that method for
|
|
all present elements in an unspecified order.
|
|
|
|
User can define mover and destructor methods for any IMR type which would
|
|
override any default methods that this type might have.
|
|
|
|
Neither movers nor destructors are allowed to fail.
|
|
|
|
```c++
|
|
template<>
|
|
struct imr::mover<IMRType> {
|
|
static void run(uint8_t* ptr, const Context& ctx) noexcept {
|
|
// user-defined mover action
|
|
}
|
|
};
|
|
|
|
template<>
|
|
struct imr::destructor<IMRType> {
|
|
static void run(const uint8_t* ptr, const Context& ctx) noexcept {
|
|
// user-defined destructor action
|
|
}
|
|
};
|
|
|
|
```
|
|
|
|
### Mover
|
|
|
|
When an IMR object is being moved its serialised form is copied to the new
|
|
place in memory and then a mover method is called with the address of the new
|
|
location passed as an argument.
|
|
|
|
*Note that, unlike C++, IMR does not call the destructor of the object moved
|
|
from. In other words, while moving in C++ involves creating a new object that
|
|
steals internal state from another, IMR just copies the object to the new
|
|
location and invokes a mover to give it a chance to do necessary updates, it is
|
|
however still considered to be the same object as the original one.*
|
|
|
|
### Destructor
|
|
|
|
The destructor of an IMR object is called before the storage it occupies is
|
|
freed. The address of the object passed to the destructor is either a location
|
|
at which the object was originally created or the address that was given to the
|
|
latest invocation of the object mover method.
|
|
|
|
## Memory allocations
|
|
|
|
IMR objects are allowed to own memory and it is therefore expected that during
|
|
serialisation there will be a need to perform allocation. A convenience class
|
|
`imr::alloc::object_allocator` exists to aid this process.
|
|
|
|
The object allocator provides serialisation helpers for both sizing and writing
|
|
phases, which act similarily to the structure serialisation helpers and return
|
|
a pointer to the owned object. After the sizing phase member function
|
|
`allocate_all()` needs to be called which allocates all the reserved buffers.
|
|
|
|
*Note that in case there is a chain of owned objects (e.g. a list) placeholders
|
|
can be used for pointers to avoid recursion.*
|
|
|
|
```c++
|
|
using imr_type1 = ...;
|
|
using imr_type2 = ...;
|
|
|
|
imr::alloc::object_allocator allocator;
|
|
|
|
auto writer = [&] (auto serializer, auto& allocator) noexcept {
|
|
imr::placeholder<imr::pod<void*>> pointer;
|
|
auto ret = serializer
|
|
.serialize(pointer)
|
|
.done();
|
|
|
|
auto pointer_to_owned_object = allocator.allocate_nested<imr_type2>()
|
|
.serialize(...)
|
|
.serialize(...)
|
|
.done();
|
|
|
|
pointer.serialize(pointer_to_owned_object);
|
|
|
|
return ret;
|
|
};
|
|
|
|
auto size = imr_type1::size_when_serialized(writer , allocator.get_size());
|
|
|
|
auto obj = std::make_unique<uint8_t[]>(size.total_storage());
|
|
allocator.allocate_all();
|
|
|
|
imr_type1::serialize(obj.get(), writer, allocator.get_serializer());
|
|
```
|
|
|
|
### Log-Structured Allocator support
|
|
|
|
LSA relies on migrator objects to provide logic necessary for moving objects
|
|
and obtaining their size. To do these action the information about the IMR type
|
|
of the object needs to be preserved as well as at least minimal external
|
|
context sufficient for calling the mover.
|
|
|
|
A helper class template is provided which takes the IMR type and (context
|
|
factory)[#context_factories] type as template parameters as well as actual
|
|
context factory in constructor and implements `migrate_fn_type` interface.
|
|
|
|
```c++
|
|
imr::alloc::lsa_migrate_fn<IMRType, ContextFactory> migrator(context_factory);
|
|
auto ptr = current_allocator().alloc(migrator, size, 1);
|
|
```
|