we do not need to hold the read locks at the higher
layer instead before reading the body, instead hold
the read locks properly at the time of renamePart()
for protection from racy part overwrites to compete
with concurrent completeMultipart().
CheckParts call can take time to verify 10k parts of a single object in a single drive.
To avoid an internal dealine of one minute in the single handler RPC, this commit will
switch to streaming RPC instead.
This API had missing permissions checking, allowing a user to change
their policy mapping by:
1. Craft iam-info.zip file: Update own user permission in
user_mappings.json
2. Upload it via `mc admin cluster iam import nobody iam-info.zip`
Here `nobody` can be a user with pretty much any kind of permission (but
not anonymous) and this ends up working.
Some more detailed steps - start from a fresh setup:
```
./minio server /tmp/d{1...4} &
mc alias set myminio http://localhost:9000 minioadmin minioadmin
mc admin user add myminio nobody nobody123
mc admin policy attach myminio readwrite nobody nobody123
mc alias set nobody http://localhost:9000 nobody nobody123
mc admin cluster iam export myminio
mkdir /tmp/x && mv myminio-iam-info.zip /tmp/x
cd /tmp/x
unzip myminio-iam-info.zip
echo '{"nobody":{"version":1,"policy":"consoleAdmin","updatedAt":"2024-08-13T19:47:10.1Z"}}' > \
iam-assets/user_mappings.json
zip -r myminio-iam-info-updated.zip iam-assets/
mc admin cluster iam import nobody ./myminio-iam-info-updated.zip
mc admin service restart nobody
```
`mc admin heal ALIAS/bucket/object` does not have any flag to heal
object noncurrent versions, this commit will make healing of the object
noncurrent versions implicitly asked.
This also fixes the 'mc admin heal ALIAS/bucket/object' that does not work
correctly when the bucket is versioned. This has been broken since Apr 2023.
Golang http.Server will call SetReadDeadline overwriting the previous
deadline configuration set after a new connection Accept in the custom
listener code. Therefore, --idle-timeout was not correctly respected.
Make http.Server read/write timeout similar to --idle-timeout.
The code assigns corrupted state to a drive for any unexpected error,
which is confusing for users. This change will make sure to assign
corrupted state only for corrupted parts or xl.meta. Use unknown state
with a explanation for any unexpected error, like canceled, deadline
errors, drive timeout, ...
Also make sure to return the bucket/object name when the object is not
found or marked not found by the heal dangling code.
* Add the policy name to the audit log tags when doing policy-based API calls
* Audit log the retention settings requested in the API call
* Audit log of retention on PutObjectRetention API path too
The RemoveUser API only removes internal users, and it reports success
when it didnt find the internal user account for deletion. When provided
with a service account, it should not report success as that is misleading.
Update github.com/cosnicolaou/pbzip2 to latest version for
significant performance improvements. This update brings a 45%
reduction in processing time.
Previously, not setting http.Config.HTTPTimeout for logger webhook
was resulting in a timeout of 0, and causing "deadline exceeded"
errors in log webhook.
This change introduces a new env variable for configuring log webhook
timeout and more importantly sets it when the config is initialised.
Manual heal can return XMinioHealInvalidClientToken if the manual
healing is started in the first node, and the next mc call to get the
heal status is landed on another node. The reason is that redirection
based on the token ID is not able to redirect requests to the first node
due to a typo.
This also affects the batch cancel command if the batch is being done in
the first node, the user will never be able to cancel it due to the same
bug.
HTTP likes to slap an infinite read deadline on a connection and
do a blocking read while the response is being written.
This effectively means that a reading deadline becomes the
request-response deadline.
Instead of enforcing our timeout, we pass it through and keep
"infinite deadline" is sticky on connections.
However, we still "record" when reads are aborted, so we never overwrite that.
The HTTP server should have `ReadTimeout` and `IdleTimeout` set for the deadline to be effective.
Use --idle-timeout for incoming connections.
Since DeadlineConn would send deadline updates directly upstream,
it would race with Read/Write operations. The stdlib will perform a read,
but do an async SetReadDeadLine(unix(1)) to cancel the Read in
`abortPendingRead`. In this case, the Read may override the
deadline intended to cancel the read.
Stop updating deadlines if a deadline in the past is seen and when Close is called.
A mutex now protects all upstream deadline calls to avoid races.
This should fix the short-term buildup of...
```
365 @ 0x44112e 0x4756b9 0x475699 0x483525 0x732286 0x737407 0x73816b 0x479601
# 0x475698 sync.runtime_notifyListWait+0x138 runtime/sema.go:569
# 0x483524 sync.(*Cond).Wait+0x84 sync/cond.go:70
# 0x732285 net/http.(*connReader).abortPendingRead+0xa5 net/http/server.go:729
# 0x737406 net/http.(*response).finishRequest+0x86 net/http/server.go:1676
# 0x73816a net/http.(*conn).serve+0x62a net/http/server.go:2050
```
AFAICT Only affects internode calls that create a connection (non-grid).
These are needed checks for the functions to be un-crashable with any input
given to `msgUnPath` (tested with fuzzing).
Both conditions would result in a crash, which prevents that. Some
additional upstream checks are needed.
Fixes#20610
This PR fixes a regression introduced in https://github.com/minio/minio/pull/19797
by restoring the healing ability of transitioned objects
Bonus: support for transitioned objects to carry original
The object name is for future reverse lookups if necessary.
Also fix parity calculation for tiered objects to n/2 for n/2 == (parity)
Existing implementation runs IAM purge routines for expired LDAP and
OIDC accounts with a probability of 0.25 after every IAM refresh. This
change ensures that they are run once in each hour.
xlStorage.Healing() returns nil if there is an error reading
.healing.bin or if this latter is empty. healing.bin update()
call returns early if .healing.bin is empty; hence, no further update
of .healing.bin is possible.
A .healing.bin can be empty if os.Open() with O_TRUNC is successful
but the next Write returns an error.
To avoid this weird situation, avoid making healingTracker.update()
to return early if .healing.bin is empty, so write again.
This commit also fixes wrong error log printing when an object is
healed in another drive in the same erasure set but not in the drive
that is actively healing by fresh drive healing code. Currently, it prints
<nil> instead of a factual error.
* heal: Scan .minio.sys metadata only during site-wide heal (#137)
mc admin heal always invoke .minio.sys heal, but sometimes, this latter
contains a lot of data, many service accounts, STS accounts etc, which
makes mc admin heal command very slow.
Only invoke .minio.sys healing when no bucket was specified in `mc admin
heal` command.
Healing a large object with a normal scan mode where no parts read
is involved can still fail after 30 seconds if an object has
There are too many parts when hard disks are being used mainly.
The reason is there is a general deadline that checks for all parts we
do a deadline per part.
When listing objects with metadata, avoid returning an "expires" time
metadata value when its value is the zero time as this means that no
expires value is set on the object.
The condition were incorrect as we were comparing the filter
value against the modification time object.
For example if created after filter date is after modification
time of object, that means object was created before the filter
time and should be skipped while replication because as per the
filter we need only the objects created after the filter date.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
When Keycloak vendor is set, the code will start to clean up service
accounts that parents do not exist anymore. However, the code will also
look for the parent user of site-replicator-0, MINIO_ROOT_USER, which
obviously does not exist in Keycloak. Therefore, the site-replicator-0
will be removed automatically.
This commit will avoid cleaning up service accounts generated from
the root user.
The verify file handler response format was changed from gob to msgp
since two months but we forgot updating the verify handler client.
VerifyFile is only called during a heal deep scan (bitrot check).
HealObject() will fail in that case and will mark all disks corrupted and
will return early (as unrecoverable object but it will also not be
removed)
It is a bit rare for HealObject to be called with a deep scan flag. It
is called when a HealObject with a normal scan (e.g. new drive healing)
detects a bitrot corruption, therefore healing objects with a detected
bitrot corruption will fail.
in cases where we cannot possibly know a way to read and
construct the object, it is impossible to achieve any form of
quorum via xl.meta while we have sufficient responses from
all the drives, we should return object not found.
- PutObjectMetadata()
- PutObjectTags()
- DeleteObjectTags()
- TransitionObject()
- RestoreTransitionObject()
Also improve the behavior of multipart code across
pool locks, hold locks only once per upload ID for
- CompleteMultipartUpload()
- AbortMultipartUpload()
- ListObjectParts() (read-lock)
- GetMultipartInfo() (read-lock)
- PutObjectPart() (read-lock)
This avoids lock attempts across pools for no
reason, this increases O(n) when there are n-pools.
multi-object deletion may or may not compete with locks
granted for other callers, causing concurrent operations
to succeed on each other.
A continuation of the PR https://github.com/minio/minio/pull/20356
When custom authorization via plugin is enabled, the console will now
render the UI as if all actions are allowed. Since server cannot
determine the exact policy allowed for a user via the plugin, this is
acceptable to do. If a particular action is actually not allowed by the
plugin the call will result in an error.
Previously the server was evaluating a policy when custom authZ is
enabled - this is fixed now.
The "unlimited" value on PPC wasn't exactly the same as amd64.
Instead compare against an "unreasonably big value".
Would cause OOM in anything using the concurrent request limit.
Add TTFB for all requests in metrics-v3 in addition to the existing
GetObject. Also for the requests that do not return a body in the
response, calculate TTFB as the HTTP status code and the headers are
sent.
A batch job will fail if the retry attempt is not provided. The reason
is that the code mistakenly gets the retry attempts from the job status
rather than the job yaml file.
This will also set a default empty prefix for batch expiration.
Also this will avoid trimming the prefix since the yaml decoder already
does that if no quotes were provided, and we should not trim if quotes
were provided and the user provided a leading or a trailing space.
Tests if imported service accounts have
required access to buckets and objects.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
- PutObject() for multi-pooled was holding large
region locks, which was not necessary. This affects
almost all slowpoke clients and lengthy uploads.
- Re-arrange locks for CompleteMultipart, PutObject
to be close to rename()
Currently, it is not possible to remove a tier if it is not accessible
or contains some data, add a force flag to make the removal successful
in that case.
When the encryption and compression are both enabled, the
the server will avoid compressing the data for no apparent reason
This commit will enable it and update unit tests.
Download static cURL into release Docker image for all supported architectures.
Currently, the static cURL executable is only downloaded for the `amd64` architecture. However, `arm64` and `ppc64le` variants are also [available](https://github.com/moparisthebest/static-curl/releases/tag/v8.7.1).
postUpload() incorrectly saves actual size as '-1'
we should save correct size when its possible.
Bonus: fix the PutObjectPart() write locker, instead
of holding a lock before we read the client stream.
We should hold it only when we need to commit the parts.
this cache will be honored only when `prefix=""` while
performing ListMultipartUploads() operation.
This is mainly to satisfy applications like alluxio
for their underfs implementation and tests.
replaces https://github.com/minio/minio/pull/20181
AFAICT we send a canceled context to unlock (and thereby releaseAll). This will cause network calls to fail.
Instead use background and add 30s timeout.
Current implementation retries forever until our
log buffer is full, and we start dropping events.
This PR allows you to set a value until we give
up on existing audit/logger batches to proceed to
process the new ones.
Bonus:
- do not blow up buffers beyond batchSize value
- do not leak the ticker if the worker returns
The items will be saved per target batch and will
be committed to the queue store when the batch is full
Also, periodically commit the batched items to the queue store
based on configured commit_timeout; default is 30s;
Bonus: compress queue store multi writes
rebalance metadata is good to have only,
if it cannot be loaded when starting MinIO
for some reason we can possibly ignore it
and move on and let user start rebalance
again if needed.
readParts requires that both part.N and part.N.meta files be present.
This change addresses an issue with how an error to return to the upper
layers was picked from most drives where a UploadPart operation
had failed.
By default, even if MINIO_BROWSER=off set code tries to get free
port available for the console.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
S3 spec does not accept an ILM XML document containing both <Filter>
and <Prefix> XML tags, even if both are empty. That is why we added
a 'set' field in some lifecycle structures to decide when and when not to
show a tag. However, we forgot to disallow marshaling of Filter when
'set' is set to false.
This will fix ILM document replication in a site replication
configuration in some cases.
When the prefix field is not provided in the remote source of a yaml
replication job, the code fails to do listing and makes replication
successful. This commit fixes it.
This change adds a consistent nonce to ensure
that multipart uploads are deterministic on a
per-part basis.
Thanks to @klauspost for the work here minio/sio@3cd3734
locks handed by different pools would become non-compete for
multi-object delete request, this is wrong for obvious
reasons.
New locking implementation and revamp will rewrite multi-object
lock anyway, this is a workaround for now.
Currently, the bucket events and replication targets are only reloaded
with buckets that failed to load during the first cluster startup,
which is wrong because if one bucket change was done in one node but
that node was not able to notify other nodes; the other nodes will
reload the bucket metadata config but fails to set the events and bucket
targets in the memory.
when a hung drive is hot-unplugged, the server might go
into a loop where the previous `format.json` is somehow
still accessible to the process, we try to re-init() drives,
but that seems to cause a previous goroutine to hang around
since it is not canceled away when the drive is closed.
Bonus: add deadline for immediate purge routine, to unblock
it if the drive is blocking mutations.
if a user policy is found, avoid reading from the drives
for missing group mappings, group mappings are not mandatory
and conditional.
This PR restores the older behavior while making sure that
if a direct user policy is not found, we would still attempt
to load from the group from the drives.
This commit simplifies and optimizes the decryption of large (multipart)
objects. This PR does two things:
- Re-write the init logic for the decryption reader
- Reduce the number of OEK decryptions
Before, the init logic copied some SSE HTTP request headers to
parse them later. This is simplified to parsing them right away. This
removes some fields from the decryption reader struct.
Further, the decryption reader decrypted the OEK using the client-provided
key (SSE-C) or the KMS (SSE-S3 / SSE-KMS) for each part. This is redundant
since the OEK is the same for all parts. In particular, a KMS call might be a
network request. Now, the OEK is decrypted once for the entire multipart object.
This should improve latency when reading encrypted multipart objects
and reduce requests to the KMS.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
Use Walk(), which is a recursive listing with versioning, to check if
the bucket has some objects before being removed. This is beneficial
because the bucket can contain multiple dangling objects in multiple
drives.
Also, this will prevent a bug where a bucket is deleted in a deployment
that has many erasure sets but the bucket contains one or few objects
not spread to enough erasure sets.
Currently, retry healing of a new drive healing does not reset
HealedBuckets means that the next healing retry will skip those
buckets. The commit will fix this behavior.
Also, the skipped objects counter will include objects uploaded
that are uploaded after the healing is started.
sftp sends local requests to the S3 port while passing the session token
header when the account corresponds to a service account. However, this
is not permitted and will throw an error: "The security token included in the
request is invalid"
This commit will avoid passing the session token to the upper layer that
initializes MinIO client to avoid this error.
Sometimes, we need historical information in .healing.bin, such as the
number of expired objects that the healing avoids to heal and that can
create drive usage disparency in the same erasure set. For that reason,
this commit will not remove .healing.bin anymore and it will have a new
field called Finished so we know healing is finished in that drive.
Services are unfrozen before `initBackgroundReplication` is finished. This means that
the globalReplicationStats write is racy. Switch to an atomic pointer.
Provide the `ReplicationPool` with the stats, so it doesn't have to be grabbed
from the atomic pointer on every use.
All other loads and checks are nil, and calls return empty values when stats
still haven't been initialized.
* Allow a maximum of 10 seconds to start profiling operations.
* Download up to 16 profiles concurrently, but only allow 10 seconds for
each (does not include write time).
* Add cluster info as the first operation.
* Ignore remote download errors.
* Stop remote profiles if the request is terminated.
If the site replication is enabled and the code tries to extract jwt
claims while the site replication service account credentials are still
not loaded yet, the code will enter an infinite loop, causing in a
high CPU usage.
Another possibility of the infinite loop is having some service accounts
created by an old deployment version where the service account JWT was
signed by the root credentials, but not anymore.
This commit will remove the possibility of the infinite loop in the code
and add root credential fallback to extract claims from old service
accounts.
move away from map[string]interface{} to map[string]string
to simplify the audit, and also provide concise information.
avoids large allocations under load(), reduces the amount
of audit information generated, as the current implementation
was a bit free-form. instead all datastructures must be
flattened.
Previously, we checked if we had a quorum on the DataDir value.
We are removing this check, which allows reading objects with different
DataDir values in a few drives (due to a rebalance-stop race bug)
provided their eTags or ModTimes match.
Since a lot of operations load from storage, do remote calls, add a 10 second timeout to each operation.
This should make `mc admin info` return values even under extreme conditions.
- optimize writing part.N.meta by writing both part.N
and its meta in sequence without network component.
- remove part.N.meta, part.N which were partially success
ful, in quorum loss situations during renamePart()
- allow for strict read quorum check arbitrated via ETag
for the given part number, this makes it double safer
upon final commit.
- return an appropriate error when read quorum is missing,
instead of returning InvalidPart{}, which is non-retryable
error. This kind of situation can happen when many
nodes are going offline in rotation, an example of such
a restart() behavior is statefulset updates in k8s.
fixes#20091
during rebalance stop, it can possibly happen that
Put() would race by overwriting the same object again.
This may very well if done "successfully" it can
potentially proceed to delete the object from the pool,
causing data loss.
This PR enhances #20233 to handle more scenarios such
as these.
Rebalance-stop can race with ongoing rebalance operations. This change
prevents these operations from overwriting objects by checking the source
and destination pool indices are different.
This commit replaces the LDAP client TLS config and
adds a custom list of TLS cipher suites which support
RSA key exchange (RSA kex).
Some LDAP server connections experience a significant slowdown
when these cipher suites are not available. The Go TLS stack
disables them by default. (Can be enabled via GODEBUG=tlsrsakex=1).
fixes https://github.com/minio/minio/issues/20214
With a custom list of TLS ciphers, Go can pick the TLS RSA key-exchange
cipher. Ref:
```
if c.CipherSuites != nil {
return c.CipherSuites
}
if tlsrsakex.Value() == "1" {
return defaultCipherSuitesWithRSAKex
}
```
Ref: https://cs.opensource.google/go/go/+/refs/tags/go1.22.5:src/crypto/tls/common.go;l=1017
Signed-off-by: Andreas Auernhammer <github@aead.dev>
this allows for de-duplicating the callers when called
concurrently, allowing for bucketmetadata reads to be
single call. All concurrent callers will get the same data
as the first one.
Fix a regression in #19733 where TTFB metrics for all APIs except
GetObject were removed in v2 and v3 metrics. This causes breakage for
existing v2 metrics users. Instead we continue to send TTFB for all APIs
in V2 but only send for GetObject in V3.
allow multipart uploads expiration to be dyamic
It would seem like the new values will take effect
only after a restart for changes in multipart_expiration.
This PR fixes this by making it dynamic as it should have
been.
deadlines per moveToTrash() allows for a more granular timeout
approach for syscalls, instead of an aggregate timeout.
This PR also enhances multipart state cleanup to be optimal by
removing 100's of multipart network rename() calls into single
network call.
context deadline was introduced to avoid a slow transfer from blocking
replication queue(s) shared by other buckets that may not be under throttling.
This PR removes this context deadline for larger objects since they are
anyway restricted to a limited set of workers. Otherwise, objects would
get dequeued when the throttle limit is exceeded and cannot proceed
within the deadline.
epoll contention on TCP causes latency build-up when
we have high volume ingress. This PR is an attempt to
relieve this pressure.
upstream issue https://github.com/golang/go/issues/65064
It seems to be a deeper problem; haven't yet tried the fix
provide in this issue, but however this change without
changing the compiler helps.
Of course, this is a workaround for now, hoping for a
more comprehensive fix from Go runtime.
- ReadVersion
- ReadFile
- ReadXL
Further changes include to
- Compact internode resource RPC paths
- Compact internode query params
To optimize on parsing by gorilla/mux as the
length of this string increases latency in
gorilla/mux - reduce to a meaningful string.
the main reason is to let Go net/http perform necessary
book keeping properly, and in essential from consistency
point of view its GETs all the way.
Deprecate sendFile() as its buggy inside Go runtime.
For a non-tiered object, MinIO requires that EcM (# of data blocks) of
xl.meta agree, corresponding to the number of data blocks needed to
read this object.
OTOH, tiered objects have metadata in the hot tier and data in the
warm tier. The data and its integrity are offloaded to the warm tier. This
allows us to reduce the read quorum from EcM (typically > N/2, where N -
erasure stripe width) to N/2 + 1. The simple majority of metadata
ensures consensus on what the object is and where it is
located.
When a drive is in a failed state when a single node multiple drives
deployment is started, a replacement of a fresh disk will not be
properly healed unless the user restarts the node.
Fix this by always adding the new fresh disk to globalLocalDrivesMap. Also
remove globalLocalDrives for simplification, a map to store local node
drives can still be used since the order of local drives of a node is
not defined.
kms: Expose API available when bucket federation is enabled
When bucket federation feature is enabled, KMS API will not work, such
as `mc admin kms key list`
The commit will fix the issue by disabling bucket forwarding when this
is a KMS request.
Tracing syscalls, opening and reading an `xl.meta` looks like this:
```
openat(AT_FDCWD, "/mnt/drive1/ss8-old/testbucket/ObjSize4MiBThreads72/(554O51H/peTb(0iztdbTKw59.csv/xl.meta", O_RDONLY|O_NOATIME|O_CLOEXEC) = 34 <0.000>
fcntl(34, F_GETFL) = 0x48000 (flags O_RDONLY|O_LARGEFILE|O_NOATIME) <0.000>
fcntl(34, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_NOATIME) = 0 <0.000>
epoll_ctl(4, EPOLL_CTL_ADD, 34, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=3172471557, u64=8145488475984499461}}) = -1 EPERM (Operation not permitted) <0.000>
fcntl(34, F_GETFL) = 0x48800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_NOATIME) <0.000>
fcntl(34, F_SETFL, O_RDONLY|O_LARGEFILE|O_NOATIME) = 0 <0.000>
fstat(34, {st_mode=S_IFREG|0644, st_size=354, ...}) = 0 <0.000>
read(34, "XL2 \1\0\3\0\306\0\0\1P\2\2\1\304$\225\304\20\0\0\0\0\0\0\0\0\0\0\0"..., 354) = 354 <0.000>
close(34) = 0 <0.000>
```
Everything until `fstat` is the `os.Open` call.
Looking at the code: https://github.com/golang/go/blob/master/src/os/file_unix.go#L212-L243
It seems for every file it "tries" to see if it is pollable. This causes `syscall.SetNonblock(fd, true)` to be called. This is the first `F_SETFL`.
It then calls `f.pfd.Init("file", true)`. This will attempt to set it as pollable using `epoll_ctl`. This will always fail for files. It therefore calls `syscall.SetNonblock(fd, false)` resulting in the second `F_SETFL`.
If we set the `O_NONBLOCK` call on the initial open, we should avoid the 4 `fcntl` syscalls per file.
I don't see any way to avoid the `epoll_ctl` call, since kind is either `kindOpenFile` or `kindNonBlock`, so "pollable" will always be true. However avoiding 4 of 6 syscalls still seems worth it.
This should not have any effect, since files will end up with "nonblock" anyway.
allow non-inlined on disk to be inlined via
an unversioned ReadVersion() call, we only
need ReadXL() to resolve objects with multiple
versions only.
The choice of this block makes it to be dynamic
and chosen by the user via `mc admin config set`
Other bonus things
- Start measuring internode TTFB performance.
- Set TCP_NODELAY, TCP_CORK for low latency
Use `runtime.Gosched()` if we have less than maxMergeMessages and the
queue is empty. Up maxMergeMessages to 50 to merge more messages into
a single write.
Add length check for an early bailout on readAllInto when we know packet length.
This commit enforces FIPS-compliant TLS ciphers in FIPS mode
by importing the `fipsonly` module.
Otherwise, MinIO still accepts non-FIPS compliant TLS connections.
removes contentious usage of mutexes in LRU, which
were never really reused in any manner; we do not
need it.
To trust hosts, the correct way is TLS certs; this PR completely
removes this dependency, which has never been useful.
```
0 0% 100% 25.83s 26.76% github.com/hashicorp/golang-lru/v2/expirable.(*LRU[...])
0 0% 100% 28.03s 29.04% github.com/hashicorp/golang-lru/v2/expirable.(*LRU[...])
```
Bonus: use `x-minio-time` as a nanosecond to avoid unnecessary
parsing logic of time strings instead of using a more
straightforward mechanism.
- Also, fix failure reporting at the end.
- Also, avoid parsing report objects when listing or resuming jobs, this
does not cause any bugs, it is only printing, not useful errors.
Split the read and write sides of handleMessages into two separate functions
Cosmetic. The only non-copy-and-paste change is that `cancel(ErrDisconnected)` is moved
into the defer on `readStream`.
add unit test for v3 metrics for all its exposed endpoints
Bonus:
- support OpenMetrics encoding
- adds boot time for prometheus
- continueOnError is better to serve as
much metrics as possible.
```
curl http://localhost:9000/minio/metrics/v3/cluster/usage/buckets
```
Did not work as documented, due to the fact that there was a typo
in the bucket usage metrics registration group. This endpoint is
a cluster endpoint and does not require any `buckets` argument.
When creating the async listing, if the first request does not return within 3
minutes, it is stopped, since it isn't being kept alive.
Keep updating `lastHandout` while we are waiting for the initial request to be fulfilled.
When a fresh drive healing is finished, add more checks for the drive listing
errors. If any, re-list and heal again. Although this is an infrequent use
case to have listPathRaw() returning nil when minDisks is set to 1, we
still need to handle all possible use cases to avoid missing healing
any object.
Also, check for HealObject result to decide of an object is healed in the
fresh disk since HealObject returns nil if an object is healed in any
disk, and not in the new fresh drive.
In regular listing, this commit will avoid showing an object when its
latest version has a pending or failed deletion. In replicated setup.
It will also prevent showing older versions in the same case.
Also, sort the error map for multiple sites in ascending order
of deployment IDs, so that the error message generated is always
definitive order and same.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Add `ConnDialer` to abstract connection creation.
- `IncomingConn(ctx context.Context, conn net.Conn)` is provided as an entry point for
incoming custom connections.
- `ConnectWS` is provided to create web socket connections.
If `SkipReader` is called with a small initial buffer it may be doing a huge number if Reads to skip the requested number of bytes. If a small buffer is provided grab a 32K buffer and use that.
Fixes slow execution of `testAPIGetObjectWithMPHandler`.
Bonuses:
* Use `-short` with `-race` test.
* Do all suite test types with `-short`.
* Enable compressed+encrypted in `testAPIGetObjectWithMPHandler`.
* Disable big file tests in `testAPIGetObjectWithMPHandler` when using `-short`.
The code was advertenly passing max openfds to debug.SetMemoryLimit(),
fixing this accelerate go test in my machine.
This is only a testing bug, since the server context has always a valid
MaxMem, so the buggy code was never called in users environments.
Currently the status of a completed or failed batch is held in the
memory, a simple restart will lose the status and the user will not
have any visibility of the job that was long running.
In addition to the metrics, add a new API that reads the batch status
from the drives. A batch job will be cleaned up three days after
completion.
Also add the batch type in the batch id, the reason is that the batch
job request is removed immediately when the job is finished, then we
do not know the type of batch job anymore, hence a difficulty to locate
the job report
Hot load a policy document when during account authorization evaluation
to avoid returning 403 during server startup, when not all policies are
already loaded.
Add this support for group policies as well.
you can attempt a rebalance first i.e, start with 2 pools.
```
mc admin rebalance start alias/
```
and after that you can add a new pool, this would
potentially crash.
```
Jun 27 09:22:19 xxx minio[7828]: panic: runtime error: invalid memory address or nil pointer dereference
Jun 27 09:22:19 xxx minio[7828]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x22cc225]
Jun 27 09:22:19 xxx minio[7828]: goroutine 1 [running]:
Jun 27 09:22:19 xxx minio[7828]: github.com/minio/minio/cmd.(*erasureServerPools).findIndex(...)
```
without atomic load() it is possible that for
a slow receiver we would get into a hot-loop, when
logCh is full and there are many incoming callers.
to avoid this as a workaround enable BATCH_SIZE
greater than 100 to ensure that your slow receiver
receives data in bulk to avoid being throttled in
some manner.
this PR however fixes the unprotected access to
the current workers value.
specifying customer HTTP client makes the gcs SDK
ignore the passed credentials, instead let the GCS
SDK manage the transport.
this PR fixes#19922 a regression from #19565
Currently, bucket metadata is being loaded serially inside ListBuckets
Objet API. Fix that by loading the bucket metadata as the number of
erasure sets * 10, which is a good approximation.
When a reconnection happens, `handleMessages` must be able to complete and exit.
This can be prevented in a full queue.
Deadlock chain (May 10th release)
```
1 @ 0x44110e 0x453125 0x109f88c 0x109f7d5 0x10a472c 0x10a3f72 0x10a34ed 0x4795e1
# 0x109f88b github.com/minio/minio/internal/grid.(*Connection).send+0x3eb github.com/minio/minio/internal/grid/connection.go:548
# 0x109f7d4 github.com/minio/minio/internal/grid.(*Connection).queueMsg+0x334 github.com/minio/minio/internal/grid/connection.go:586
# 0x10a472b github.com/minio/minio/internal/grid.(*Connection).handleAckMux+0xab github.com/minio/minio/internal/grid/connection.go:1284
# 0x10a3f71 github.com/minio/minio/internal/grid.(*Connection).handleMsg+0x231 github.com/minio/minio/internal/grid/connection.go:1211
# 0x10a34ec github.com/minio/minio/internal/grid.(*Connection).handleMessages.func1+0x6cc github.com/minio/minio/internal/grid/connection.go:1019
---> blocks ---> via (Connection).handleMsgWg
1 @ 0x44110e 0x454165 0x454134 0x475325 0x486b08 0x10a161a 0x10a1465 0x2470e67 0x7395a9 0x20e61af 0x20e5f1f 0x7395a9 0x22f781c 0x7395a9 0x22f89a5 0x7395a9 0x22f6e82 0x7395a9 0x22f49a2 0x7395a9 0x2206e45 0x7395a9 0x22f4d9c 0x7395a9 0x210ba06 0x7395a9 0x23089c2 0x7395a9 0x22f86e9 0x7395a9 0xd42582 0x2106c04
# 0x475324 sync.runtime_Semacquire+0x24 runtime/sema.go:62
# 0x486b07 sync.(*WaitGroup).Wait+0x47 sync/waitgroup.go:116
# 0x10a1619 github.com/minio/minio/internal/grid.(*Connection).reconnected+0xb9 github.com/minio/minio/internal/grid/connection.go:857
# 0x10a1464 github.com/minio/minio/internal/grid.(*Connection).handleIncoming+0x384 github.com/minio/minio/internal/grid/connection.go:825
```
Add a queue cleaner in reconnected that will pop old messages so `handleMessages` can
send messages without blocking and exit appropriately for the connection to be re-established.
Messages are likely dropped by the remote, but we may have some that can succeed,
so we only drop when running out of space.
Add the inlined data as base64 encoded field and try to add a string version if feasible.
Example:
```
λ xl-meta -data xl.meta
{
"8e03504e-1123-4957-b272-7bc53eda0d55": {
"bitrot_valid": true,
"bytes": 58,
"data_base64": "Z29sYW5nLm9yZy94L3N5cyB2MC4xNS4wIC8=",
"data_string": "golang.org/x/sys v0.15.0 /"
}
```
The string will have quotes, newlines escaped to produce valid JSON.
If content isn't valid utf8 or the encoding otherwise fails, only the base64 data will be added.
`-export` can still be used separately to extract the data as files (including bitrot).
* Prevents blocking when losing quorum (standard on cluster restarts).
* Time out to prevent endless buildup. Timed-out remote locks will be canceled because they miss the refresh anyway.
* Reduces latency for all calls since the wall time for the roundtrip to remotes no longer adds to the requests.
since the connection is active, the
response recorder body can grow endlessly
causing leak, as this bytes buffer is
never given back to GC due to an goroutine.
skip healing properly in scanner when drive is hotplugged
due to how the state is passed around the SkipHealing
might not be the true state() of the system always, causing
a situation where we might healing from the scanner on the
same drive which is being. Due to this competing heals get
triggered that slow each other down.
due to a historic bug in CopyObject() where
an inlined object loses its metadata, the
check causes an incorrect fallback verifying
data-dir.
CopyObject() bug was fixed in ffa91f9794 however
the occurrence of this problem is historic, so
the aforementioned check is stretching too much.
Bonus: simplify fileInfoRaw() to read xl.json as well,
also recreate buckets properly.
* Multipart SSEC checksums were not transferred.
* Remove key mismatch logging. This key is user-controlled with SSEC.
* If the source is SSEC and the destination reports ErrSSEEncryptedObject,
assume replication is good.
avoid concurrent callers for LoadUser() to even initiate
object read() requests, if an on-going operation is in progress.
this avoids many callers hitting the drives causing I/O
spikes, also allows for loading credentials faster.
This commit fixes an issue in the KES client configuration
that can cause the following error when connecting to KES:
```
ERROR Failed to connect to KMS: failed to generate data key with KMS key: tls: client certificate is required
```
The Go TLS stack seems to not send a client certificate if it
thinks the client certificate cannot be validated by the peer.
In case of an API key, we don't care about this since we use
public key pinning and the X.509 certificate is just a transport
encoding.
The `GetClientCertificate` seems to be honored always such that
this error does not occur.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
the reason for this is to avoid STS mappings to be
purged without a successful load of other policies,
and all the credentials only loaded successfully
are properly handled.
This also avoids unnecessary cache store which was
implemented earlier for optimization.
Directory objects are used by applications that simulate the folder
structure of an on-disk filesystem. These are zero-byte objects with names
ending with '/'. They are only used to check whether a 'folder' exists in
the namespace.
StartSize starts with the raw free space of all disks in the given pool,
however during the status, CurrentSize is not showing the current free
raw space, as expected at least by `mc admin decom status` since it was
written.
Go's net/http is notoriously difficult to have a streaming
deadlines per READ/WRITE on the net.Conn if we add them they
interfere with the Go's internal requirements for a HTTP
connection.
Remove this support for now
fixes#19853
Add partial shard reconstruction
* Add partial shard reconstruction
* Fix padding causing the last shard to be rejected
* Add md5 checks on single parts
* Move md5 verified to `verified/filename.ext`
* Move complete (without md5) to `complete/filename.ext.partno`
It's not pretty, but at least now the md5 gives some confidence it works correctly.
In the very rare case when all drives in a erasure set need to be healed,
remove .healing.bin from all drives, otherwise it will be stuck in a
loop
Also, fix a unit test that fails sometimes due to wrong test.
since #19688 there was a regression introduced during drive
lookups for single node multi-drive setups, drive replacement
would not work correctly without this PR.
This does not fix any current issue, but merging https://github.com/minio/madmin-go/pull/282
can lose the validation of the service account expiration time.
Add more defensive code for now. In the future, we should avoid doing
validation in another library.
precondition check was being honored before, validating
if anonymous access is allowed on the metadata of an
object, leading to metadata disclosure of the following
headers.
```
Last-Modified
Etag
x-amz-version-id
Expires:
Cache-Control:
```
although the information presented is minimal in nature,
and of opaque nature. It still simply discloses that an
object by a specific name exists or not without even having
enough permissions.
fix: authenticate LDAP via actual DN instead of normalized DN
Normalized DN is only for internal representation, not for
external communication, any communication to LDAP must be
based on actual user DN. LDAP servers do not understand
normalized DN.
fixes#19757
This change uses the updated ldap library in minio/pkg (bumped
up to v3). A new config parameter is added for LDAP configuration to
specify extra user attributes to load from the LDAP server and to store
them as additional claims for the user.
A test is added in sts_handlers.go that shows how to access the LDAP
attributes as a claim.
This is in preparation for adding SSH pubkey authentication to MinIO's SFTP
integration.
Add combination of multiple parts.
Parts will be reconstructed and saved separately and can manually be combined to the complete object.
Parts will be named `(version_id)-(filename).(partnum).(in)complete`.
Do not log errors on oneway streams when sending ping fails. Instead, cancel the stream.
This also makes sure pings are sent when blocked on sending responses.
This commit will fix one rare case of a multipart object that
can be read in theory but GetObject API returned an error.
It turned out that a six years old code was marking a drive offline
when the bitrot streaming fails to read a part in a disk with any error.
This can affect reading a subsequent part, though having enough shards,
but unable to construct because one drive was marked offline earlier.
This commit will remove the drive marking offline code. It will also
close the bitrotstreaming reader before marking it as nil.
Adds `-xver` which can be used with `-export` and `-combine` to attempt to combine files across versions if data is suspected to be the same. Overlapping data is compared.
Bonus: Make `inspect` accept wildcards.
Currently, on enabling callhome (or restarting the server), the callhome
job gets scheduled. This means that one has to wait for 24hrs (the
default frequency duration) to see it in action and to figure out if it
is working as expected.
It will be a better user experience to perform the first callhome
execution immediately after enabling it (or on server start if already
enabled).
Also, generate audit event on callhome execution, setting the error
field in case the execution has failed.
* Store ModTime in the upload ID; return it when listing instead of the current time.
* Use this ModTime to expire and skip reading the file info.
* Consistent upload sorting in listing (since it now has the ModTime).
* Exclude healing disks to avoid returning an empty list.
```
==================
WARNING: DATA RACE
Read at 0x0000082be990 by goroutine 205:
github.com/minio/minio/cmd.setCommonHeaders()
Previous write at 0x0000082be990 by main goroutine:
github.com/minio/minio/cmd.lookupConfigs()
```
Recent Veeam is very picky about storage class names. Add `_MINIO_VEEAM_FORCE_SC` env var.
It will override the storage class returned by the storage backend if it is non-standard
and we detect a Veeam client by checking the User Agent.
Applies to HeadObject/GetObject/ListObject*
add deadlines that can be dynamically changed via
the drive max timeout values.
Bonus: optimize "file not found" case and hung drives/network - circuit break the check and return right
away instead of waiting.
Do not log errors on oneway streams when sending ping fails. Instead cancel the stream.
This also makes sure pings are sent when blocked on sending responses.
I will do a separate PR that includes this and adds pings to two-way streams as well as tests for pings.
as that is the only API where the TTFB metric is beneficial, and
capturing this for all APIs exponentially increases the response size in
large clusters.
Replace the `io.Pipe` from streamingBitrotWriter -> CreateFile with a fixed size ring buffer.
This will add an output buffer for encoded shards to be written to disk - potentially via RPC.
This will remove blocking when `(*streamingBitrotWriter).Write` is called, and it writes hashes and data.
With current settings, the write looks like this:
```
Outbound
┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐
│ │ Parr. │ │ (http body) │ │ │ │
│ Bitrot Hash │ Write │ Pipe │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │
│ Erasure Shard │ ──────────► │ (unbuffered) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │
│ │ │ │ │ (io.Copy) │ │ │
└───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘
```
We write a Hash (32 bytes). Since the pipe is unbuffered, it will block until the 32 bytes have
been delivered to the TCP buffer, and the next Read hits the Pipe.
Then we write the shard data. This will typically be bigger than 64KB, so it will block until two blocks
have been read from the pipe.
When we insert a ring buffer:
```
Outbound
┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐
│ │ │ │ (http body) │ │ │ │
│ Bitrot Hash │ Write │ Ring Buffer │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │
│ Erasure Shard │ ──────────► │ (2MB) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │
│ │ │ │ │ (io.Copy) │ │ │
└───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘
```
The hash+shard will fit within the ring buffer, so writes will not block - but will complete after a
memcopy. Reads can fill the 64KB buffer if there is data for it.
If the network is congested, the ring buffer will become filled, and all syscalls will be on full buffers.
Only when the ring buffer is filled will erasure coding start blocking.
Since there is always "space" to write output data, we remove the parallel writing since we are
always writing to memory now, and the goroutine synchronization overhead probably not worth taking.
If the output were blocked in the existing, we would still wait for it to unblock in parallel write, so it would
make no difference there - except now the ring buffer smoothes out the load.
There are some micro-optimizations we could look at later. The biggest is that, in most cases,
we could encode directly to the ring buffer - if we are not at a boundary. Also, "force filling" the
Read requests (i.e., blocking until a full read can be completed) could be investigated and maybe
allow concurrent memory on read and write.
Metrics being added:
- read_tolerance: No of drive failures that can be tolerated without
disrupting read operations
- write_tolerance: No of drive failures that can be tolerated without
disrupting write operations
- read_health: Health of the erasure set in a pool for read operations
(1=healthy, 0=unhealthy)
- write_health: Health of the erasure set in a pool for write operations
(1=healthy, 0=unhealthy)
Adds regression test for #19699
Failures are a bit luck based, since it requires objects to be placed on different sets.
However this generates a failure prior to #19699
* Revert "Revert "Fix incorrect merging of slash-suffixed objects (#19699)""
This reverts commit f30417d9a8.
* Don't override when suffix doesn't match. Instead rely on quorum for each.
Instead of having "online" and "healing" as two metrics, replace with a
single metric "health" which can have following values:
0 = offline
1 = healthy
2 = healing
If two objects share everything but one object has a slash prefix, those would be merged in listings,
with secondary properties used for a tiebreak.
Example: An object with the key `prefix/obj` would be merged with an object named `prefix/obj/`.
While this violates the [no object can be a prefix of another](https://min.io/docs/minio/linux/operations/concepts/thresholds.html#conflicting-objects), let's resolve these.
If we have an object with 'name' and a directory named 'name/' discard the directory only - but allow objects
of 'name' and 'name/' (xldir) to be uniquely returned.
Regression from #15772
canceled callers might linger around longer,
can potentially overwhelm the system. Instead
provider a caller context and canceled callers
don't hold on to them.
Bonus: we have no reason to cache errors, we should
never cache errors otherwise we can potentially have
quorum errors creeping in unexpectedly. We should
let the cache when invalidating hit the actual resources
instead.
LastPong is saved as nanoseconds after a connection or reconnection but
saved as seconds when receiving a pong message. The code deciding if
a pong is too old can be skewed since it assumes LastPong is only in
seconds.
Accept multipart uploads where the combined checksum provides the expected part count.
It seems this was added by AWS to make the API more consistent, even if the
data is entirely superfluous on multiple levels.
Improves AWS S3 compatibility.
This commit adds support for MinKMS. Now, there are three KMS
implementations in `internal/kms`: Builtin, MinIO KES and MinIO KMS.
Adding another KMS integration required some cleanup. In particular:
- Various KMS APIs that haven't been and are not used have been
removed. A lot of the code was broken anyway.
- Metrics are now monitored by the `kms.KMS` itself. For basic
metrics this is simpler than collecting metrics for external
servers. In particular, each KES server returns its own metrics
and no cluster-level view.
- The builtin KMS now uses the same en/decryption implemented by
MinKMS and KES. It still supports decryption of the previous
ciphertext format. It's backwards compatible.
- Data encryption keys now include a master key version since MinKMS
supports multiple versions (~4 billion in total and 10000 concurrent)
per key name.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
If used, 'opts.Marker` will cause many missed entries since results are returned
unsorted, and pools are serialized.
Switch to fully concurrent listing and merging across pools to return sorted entries.
It is expected that whoever is using the credentials which has
the proper set of permissions must be able to run.
`mc support perf object`
While the root login is disabled.
fixes#19648
AWS S3 returns the actual object size as part of XML
response for InvalidRange error, this is used apparently
by SDKs to retry the request without the range.
'opts.Marker` is causing many missed entries if used since results are returned unsorted. Also since pools are serialized.
Switch to do fully concurrent listing and merging across pools to return sorted entries.
Returning errors on listings is impossible with the current API, so document that.
Return an error at once if no drives are found instead of just returning an empty listing and no error.
This is to support deployments migrating from a multi-pooled
wider stripe to lower stripe. MINIO_STORAGE_CLASS_STANDARD
is still expected to be same for all pools. So you can satisfy
adding custom drive count based pools by adjusting the storage
class value.
```
version: v2
address: ':9000'
rootUser: 'minioadmin'
rootPassword: 'minioadmin'
console-address: ':9001'
pools: # Specify the nodes and drives with pools
-
args:
- 'node{11...14}.example.net/data{1...4}'
-
args:
- 'node{15...18}.example.net/data{1...4}'
-
args:
- 'node{19...22}.example.net/data{1...4}'
-
args:
- 'node{23...34}.example.net/data{1...10}'
set-drive-count: 6
```
ILM actions due to ExpiredObjectDeleteAllVersions and
DelMarkerExpiration are ignored when object locking is enabled on a
bucket.
Note: This applies to object versions which may not have retention
configured on them. This applies to all object versions in this bucket,
including those created before the retention config was applied.
Per-bucket metrics endpoints always start with /bucket and the bucket
name is appended to the path. e.g. if the collector path is /bucket/api,
the endpoint for the bucket "mybucket" would be
/minio/metrics/v3/bucket/api/mybucket
Change the existing bucket api endpoint accordingly from /api/bucket to
/bucket/api
The `Token` parameter is a sensitive value that should not be output in the Audit log for STS AssumeRoleWithCustomToken API.
Bonus: Add a simple tool that echoes audit logs to the console.
When listing, with drives returning `errFileNotFound,` `errVolumeNotFound`, or `errUnformattedDisk,`,
we could get below `minDisks` drives being left.
This would result in a quorum never being reachable for any object. Therefore, the listing
would continue, but no results would ever be produced.
Include `fnf` in the mindisk check since it is incremented on these errors. This will stop
listing when minDisks are left.
Allow `opts.minDisks` to not return errVolumeNotFound or errFileNotFound and return that.
That will allow for good results even if disks return something else.
We switch `errUnformattedDisk` to a regular error. If we have enough of those, we should just fail.
Typically not all drives are connected, so we delay 3 minutes before resuming.
This greatly reduces risk of starting to list unconnected drives, or drives we risk being disconnected soon.
This delay is not applied when starting with an admin call.
ConsoleUI like applications rely on combination of
ListServiceAccounts() and InfoServiceAccount() to populate
UI elements, however individually these calls can be slow
causing the entire UI to load sluggishly.
i.e., this rule element doesn't apply to DEL markers.
This is a breaking change to how ExpiredObejctDeleteAllVersions
functions today. This is necessary to avoid the following highly probable
footgun scenario in the future.
Scenario:
The user uses tags-based filtering to select an object's time to live(TTL).
The application sometimes deletes objects, too, making its latest
version a DEL marker. The previous implementation skipped tag-based filters
if the newest version was DEL marker, voiding the tag-based TTL. The user is
surprised to find objects that have expired sooner than expected.
* Add DelMarkerExpiration action
This ILM action removes all versions of an object if its
the latest version is a DEL marker.
```xml
<DelMarkerObjectExpiration>
<Days> 10 </Days>
</DelMarkerObjectExpiration>
```
1. Applies only to objects whose,
• The latest version is a DEL marker.
• satisfies the number of days criteria
2. Deletes all versions of this object
3. Associated rule can't have tag-based filtering
Includes,
- New bucket event type for deletion due to DelMarkerExpiration
calling a remote target remove with a perfectly
well constructed ARN can lead to a crash for a bucket
with no replication configured.
This PR fixes, and adds a crash check for ImportMetadata
as well.
Algorithms are comma separated.
Note that valid values does not in all cases represent default values.
`--sftp=pub-key-algos=...` specifies the supported client public key
authentication algorithms. Note that this doesn't include certificate types
since those use the underlying algorithm. This list is sent to the client if
it supports the server-sig-algs extension. Order is irrelevant.
Valid values
```
ssh-ed25519
sk-ssh-ed25519@openssh.comsk-ecdsa-sha2-nistp256@openssh.com
ecdsa-sha2-nistp256
ecdsa-sha2-nistp384
ecdsa-sha2-nistp521
rsa-sha2-256
rsa-sha2-512
ssh-rsa
ssh-dss
```
`--sftp=kex-algos=...` specifies the supported key-exchange algorithms in preference order.
Valid values:
```
curve25519-sha256
curve25519-sha256@libssh.org
ecdh-sha2-nistp256
ecdh-sha2-nistp384
ecdh-sha2-nistp521
diffie-hellman-group14-sha256
diffie-hellman-group16-sha512
diffie-hellman-group14-sha1
diffie-hellman-group1-sha1
```
`--sftp=cipher-algos=...` specifies the allowed cipher algorithms.
If unspecified then a sensible default is used.
Valid values:
```
aes128-ctr
aes192-ctr
aes256-ctr
aes128-gcm@openssh.comaes256-gcm@openssh.comchacha20-poly1305@openssh.com
arcfour256
arcfour128
arcfour
aes128-cbc
3des-cbc
```
`--sftp=mac-algos=...` specifies a default set of MAC algorithms in preference order.
This is based on RFC 4253, section 6.4, but with hmac-md5 variants removed because they have
reached the end of their useful life.
Valid values:
```
hmac-sha2-256-etm@openssh.comhmac-sha2-512-etm@openssh.com
hmac-sha2-256
hmac-sha2-512
hmac-sha1
hmac-sha1-96
```
This would reduce the size of data in response of metrics
listing. While graphing we can default these metrics with
a zero value if not found.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Unfreeze as soon as the incoming connection is terminated and don't wait for everything to complete.
We don't want to keep the services frozen if something becomes stuck.
- handle errFileCorrupt properly
- micro-optimization of sending done() response quicker
to close the goroutine.
- fix logger.Event() usage in a couple of places
- handle the rest of the client to return a different error other than
lastErr() when the client is closed.
listPathRaw() counts errDiskNotFound as a valid error to indicate a
listing stream end. However, storage.WalkDir() is allowed to return
errDiskNotFound anytime since grid.ErrDisconnected is converted to
errDiskNotFound.
This affects fresh disk healing and should affect S3 listing as well.
endpoint: /minio/metrics/v3/system/process
metrics:
- locks_read_total
- locks_write_total
- cpu_total_seconds
- go_routine_total
- io_rchar_bytes
- io_read_bytes
- io_wchar_bytes
- io_write_bytes
- start_time_seconds
- uptime_seconds
- file_descriptor_limit_total
- file_descriptor_open_total
- syscall_read_total
- syscall_write_total
- resident_memory_bytes
- virtual_memory_bytes
- virtual_memory_max_bytes
Since the standard process collector implements only a subset of these
metrics, remove it and implement our own custom process collector that
captures all the process metrics we need.
Since the object is being permanently deleted, the lack of read quorum should not
matter as long as sufficient disks are online to complete the deletion with parity
requirements.
If several pools have the same object with insufficient read quorum, attempt to
delete object from all the pools where it exists
At server startup, LDAP configuration is validated against the LDAP
server. If the LDAP server is down at that point, we need to cleanly
disable LDAP configuration. Previously, LDAP would remain configured but
error out in strange ways because initialization did not complete
without errors.
When importing access keys (i.e. service accounts) for LDAP accounts,
we are requiring groups to exist under one of the configured group base
DNs. This is not correct. This change fixes this by only checking for
existence and storing the normalized form of the group DN - we do not
return an error if the group is not under a base DN.
Test is updated to illustrate an import failure that would happen
without this change.
Existing IAM import logic for LDAP creates new mappings when the
normalized form of the mapping key differs from the existing mapping key
in storage. This change effectively replaces the existing mapping key by
first deleting it and then recreating with the normalized form of the
mapping key.
For e.g. if an older deployment had a policy mapped to a user DN -
`UID=alice1,OU=people,OU=hwengg,DC=min,DC=io`
instead of adding a mapping for the normalized form -
`uid=alice1,ou=people,ou=hwengg,dc=min,dc=io`
we should replace the existing mapping.
This ensures that duplicates mappings won't remain after the import.
Some additional cleanup cases are also covered. If there are multiple
mappings for the name normalized key such as:
`UID=alice1,OU=people,OU=hwengg,DC=min,DC=io`
`uid=alice1,ou=people,ou=hwengg,DC=min,DC=io`
`uid=alice1,ou=people,ou=hwengg,dc=min,dc=io`
we check if the list of policies mapped to all these keys are exactly
the same, and if so remove all of them and create a single mapping with
the normalized key. However, if the policies mapped to such keys differ,
the import operation returns an error as the server cannot automatically
pick the "right" list of policies to map.
When inspecting files like `.minio.sys/pool.bin` that may be present on multiple sets, use signature to separate them.
Also fixes null versions to actually be useful with `-export -combine`.
`minio_cluster_webhook_queue_length` was wrongly defined as `counter`
where-as it should be `gauge`
Following were wrongly defined as `gauge` when they should actually be
`counter`:
- minio_bucket_replication_sent_bytes
- minio_bucket_replication_received_bytes
- minio_bucket_replication_total_failed_bytes
- minio_bucket_replication_total_failed_count
When LDAP is enabled, previously we were:
- rejecting creation of users and groups via the IAM import functionality
- throwing a `not a valid DN` error when non-LDAP group mappings are present
This change allows for these cases as we need to support situations
where the MinIO server contains users, groups and policy mappings
created before LDAP was enabled.
instead upon any error in renameData(), we still
preserve the existing dataDir in some form for
recoverability in strange situations such as out
of disk space type errors.
Bonus: avoid running list and heal() instead allow
versions disparity to return the actual versions,
uuid to heal. Currently limit this to 100 versions
and lesser disparate objects.
an undo now reverts back the xl.meta from xl.meta.bkp
during overwrites on such flaky setups.
Bonus: Save N depth syscalls via skipping the parents
upon overwrites and versioned updates.
Flaky setup examples are stretch clusters with regular
packet drops etc, we need to add some defensive code
around to avoid dangling objects.
RenameData could start operating on inline data after timing out
and the call returned due to WithDeadline.
This could cause a buffer to write to the inline data being written.
Since no writes are in `RenameData` and the call is canceled,
this doesn't present a corruption issue. But a race is a race and
should be fixed.
Copy inline data to a fresh buffer.
This PR makes a feasible approach to handle all the scenarios
that we must face to avoid returning "panic."
Instead, we must return "errServerNotInitialized" when a
bucketMetadataSys.Get() is called, allowing the caller to
retry their operation and wait.
Bonus fix the way data-usage-cache stores the object.
Instead of storing usage-cache.bin with the bucket as
`.minio.sys/buckets`, the `buckets` must be relative
to the bucket `.minio.sys` as part of the object name.
Otherwise, there is no way to decommission entries at
`.minio.sys/buckets` and their final erasure set positions.
A bucket must never have a `/` in it. Adds code to read()
from existing data-usage.bin upon upgrade.
This PR fixes a few things
- FIPS support for missing for remote transports, causing
MinIO could end up using non-FIPS Ciphers in FIPS mode
- Avoids too many transports, they all do the same thing
to make connection pooling work properly re-use them.
- globalTCPOptions must be set before setting transport
to make sure the client conn deadlines are honored properly.
- GCS warm tier must re-use our transport
- Re-enable trailing headers support.
This reverts commit 928c0181bf.
This change was not correct, reverting.
We track 3 states with the ProxyRequest header - if replication process wants
to know if object is already replicated with a HEAD, it shouldn't proxy back
- Poorna
AWS S3 trailing header support was recently enabled on the warm tier
client connection to MinIO type remote tiers. With this enabled, we are
seeing the following error message at http transport layer.
> Unsolicited response received on idle HTTP channel starting with "HTTP/1.1 400 Bad Request\r\nContent-Type: text/plain; charset=utf-8\r\nConnection: close\r\n\r\n400 Bad Request"; err=<nil>
This is an interim fix until we identify the root cause for this behaviour in the
minio-go client package.
Keep the EC in header, so it can be retrieved easily for dynamic quorum calculations.
To not force a full metadata decode on every read the value will be 0/0 for data written in previous versions.
Size is expected to increase by 2 bytes per version, since all valid values can be represented with 1 byte each.
Example:
```
λ xl-meta xl.meta
{
"Versions": [
{
"Header": {
"EcM": 4,
"EcN": 8,
"Flags": 6,
"ModTime": "2024-04-17T11:46:25.325613+02:00",
"Signature": "0a409875",
"Type": 1,
"VersionID": "8e03504e11234957b2727bc53eda0d55"
},
...
```
Not used for operations yet.
Follow up for #19528
If there are multiple existing DN mappings for the same normalized DN,
if they all have the same policy mapping value, we pick one of them of
them instead of returning an import error.
This is a change to IAM export/import functionality. For LDAP enabled
setups, it performs additional validations:
- for policy mappings on LDAP users and groups, it ensures that the
corresponding user or group DN exists and if so uses a normalized form
of these DNs for storage
- for access keys (service accounts), it updates (i.e. validates
existence and normalizes) the internally stored parent user DN and group
DNs.
This allows for a migration path for setups in which LDAP mappings have
been stored in previous versions of the server, where the name of the
mapping file stored on drives is not in a normalized form.
An administrator needs to execute:
`mc admin iam export ALIAS`
followed by
`mc admin iam import ALIAS /path/to/export/file`
The validations are more strict and returns errors when multiple
mappings are found for the same user/group DN. This is to ensure the
mappings stored by the server are unambiguous and to reduce the
potential for confusion.
Bonus **bug fix**: IAM export of access keys (service accounts) did not
export key name, description and expiration. This is fixed in this
change too.
Reading the list metacache is not protected by a lock; the code retries when it fails
to read the metacache object, however, it forgot to re-read the metacache object
from the drives, which is necessary, especially if the metacache object is inlined.
This commit will ensure that we always re-read the metacache object from the drives
when it is retrying.
When resuming a versioned listing where `version-id-marker=null`, the `null` object would
always be returned, causing duplicate entries to be returned.
Add check against empty version
unlinking() at two different locations on a disk when there
are lots to purge, this can lead to huge IOwaits, instead
rely on rename() to .trash to avoid running multiple unlinks()
in parallel.
since mid 2018 we do not have any deployments
without deployment-id, it is time to put this
code to rest, this PR removes this old code as
its no longer valuable.
on setups with 1000's of drives these are all
quite expensive operations.
The rest of the peer clients were not consistent across nodes. So, meta cache requests
would not go to the same server if a continuation happens on a different node.
As node metrics should be scraped per node basis, use a sample
configuartion using all the nodes in targets.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
When no results match or another error occurs, add an error to the stream. Keep the "inspect-input.txt" as the only thing in the zip for reference.
Example:
```
λ mc support inspect --airgap myminio/testbucket/fjghfjh/**
mc: Using public key from C:\Users\klaus\mc\support_public.pem
File data successfully downloaded as inspect-data.enc
λ inspect inspect-data.enc
Using private key from support_private.pem
output written to inspect-data.zip
2024/04/11 14:10:51 next stream: GetRawData: No files matched the given pattern
λ unzip -l inspect-data.zip
Archive: inspect-data.zip
Length Date Time Name
--------- ---------- ----- ----
222 2024-04-11 14:10 inspect-input.txt
--------- -------
222 1 file
λ
```
Modifies inspect to read until end of stream to report the error.
Bonus: Add legacy commandline params
Add following metrics:
- used_inodes
- total_inodes
- healing
- online
- reads_per_sec
- reads_kb_per_sec
- reads_await
- writes_per_sec
- writes_kb_per_sec
- writes_await
- perc_util
To be able to calculate the `per_sec` values, we capture the IOStats-related
data in the beginning (along with the time at which they were captured),
and compare them against the current values subsequently. This is because
dividing by "time since server uptime." doesn't work in k8s environments.
the disk location never changes in the lifetime of a
MinIO cluster, even if it did validate this close to the
disk instead at the higher layer.
Return appropriate errors indicating an invalid drive, so
that the drive is not recognized as part of a valid
drive.
we have had numerous reports on some config
values not having default values, causing
features misbehaving and not having default
values set properly.
This PR tries to address all these concerns
once and for all.
Each new sub-system that gets added
- must check for invalid keys
- must have default values set
- must not "return err" when being saved into
a global state() instead collate as part of
other subsystem errors allow other sub-systems
to independently initialize.
* Allow specifying the local server, with env variable _MINIO_SERVER_LOCAL, in systems where the hostname cannot be resolved to local IP
* Limit scope of the _MINIO_SERVER_LOCAL solution to only containerized implementations
Return an error when the user specifies endpoints for both source
and target. This can generate many type of errors as the code considers
a deployment remote if its endpoint is specified.
HealObject() does not return an error in some cases, for example, when
an object is successfully reconstructed in one disk but fails with other
disks, another case is when a disk does not have the object is temporarily
disconnected
Add the After heal drives result in the audit output for better
analysis.
Set object's modTime when being restored
restored here refers to making a temporary local copy in the hot tier
for a tiered object using the RestoreObject API
we have been using an LRU caching for internode
auth tokens, migrate to using a typed implementation
and also do not cache auth tokens when its an error.
This fixes a regression from #19358 which prevents policy mappings
created in the latest release from being displayed in policy entity
listing APIs.
This is due to the possibility that the base DNs in the LDAP config are
not in a normalized form and #19358 introduced normalized of mapping
keys (user DNs and group DNs). When listing, we check if the policy
mappings are on entities that parse as valid DNs that are descendants of
the base DNs in the config.
Test added that demonstrates a failure without this fix.
Create new code paths for multiple subsystems in the code. This will
make maintaing this easier later.
Also introduce bugLogIf() for errors that should not happen in the first
place.
Using oidc.redirectUri in the values.yaml only works for the deployment.
When using the statefulset the environment variable
MINIO_IDENTITY_OPENID_REDIRECT_URI is not set. This leads to errors with
oicd providers. For example keycloak throws the error 'invalid
redirect_uri'.
This pull request fixes that.
This commit replaces the `KMS.Stat` API call with a
`KMS.GenerateKey` call. This approach is more reliable
since data key generation also works when the KMS backend
is unavailable (temp. offline), but KES has cached the
key. Ref: KES offline caching.
With this change, it is less likely that MinIO readiness
checks fail in cases where the KMS backend is offline.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
Make sure to pass a nil pointer as a Transport to minio-go when the API config
is not initialized, this will make sure that we do not pass an interface
with a known type but a nil value.
This will also fix the update of the API remote_transport_deadline
configuration without requiring the cluster restart.
Use `ODirectPoolSmall` buffers for inline data in PutObject.
Add a separate call for inline data that will fetch a buffer for the inline data before unmarshal.
This fixes a bug where STS Accounts map accumulates accounts in memory
and never removes expired accounts and the STS Policy mappings were not
being refreshed.
The STS purge routine now runs with every IAM credentials load instead
of every 4th time.
The listing of IAM files is now cached on every IAM load operation to
prevent re-listing for STS accounts purging/reload.
Additionally this change makes each server pick a time for IAM loading
that is randomly distributed from a 10 minute interval - this is to
prevent server from thundering while performing the IAM load.
On average, IAM loading will happen between every 5-15min after the
previous IAM load operation completes.
Fix issue [minio#19314], resolve the absence of the sed command in ubi-micro by replacing it with echo.
Signed-off-by: Andreas Bräu <ab@andi95.de>
Co-authored-by: jiuker <2818723467@qq.com>
If site replication enabled across sites, replicate the SSE-C
objects as well. These objects could be read from target sites
using the same client encryption keys.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
User doesn't need to remember and enter the server values,
rather they can select from the pre populated list.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Instead of relying on user input values, we use the DN value returned by
the LDAP server.
This handles cases like when a mapping is set on a DN value
`uid=svc.algorithm,OU=swengg,DC=min,DC=io` with a user input value (with
unicode variation) of `uid=svc﹒algorithm,OU=swengg,DC=min,DC=io`. The
LDAP server on lookup of this DN returns the normalized value where the
unicode dot character `SMALL FULL STOP` (in the user input), gets
replaced with regular full stop.
Bonus: remove persistent md5sum calculation, turn-off
sha256 as well. Instead we always enable crc32c which
is enough for payload verification also support for
trailing headers checksum.
As total drives count, online vs offline are per node basis, its
corect to select node for which graphs need to be rendered.
Set prometheus scrape jobs to fetch metrics from all nodes. A sample
scrape job for node metrics could be as below
```
- job_name: minio-job-node
bearer_token: <token>
metrics_path: /minio/v2/metrics/node
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: [tenant1-ss-0-0.tenant1-hl.tenant-ns.svc.cluster.local:9000,tenant1-ss-0-1.tenant1-hl.tenant-ns.svc.cluster.local:9000,tenant1-ss-0-2.tenant1-hl.tenant-ns.svc.cluster.local:9000,tenant1-ss-0-3.tenant1-hl.tenant-ns.svc.cluster.local:9000]
```
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Fix races in IAM cache
Fixes#19344
On the top level we only grab a read lock, but we write to the cache if we manage to fetch it.
a03dac41eb/cmd/iam-store.go (L446) is also flipped to what it should be AFAICT.
Change the internal cache structure to a concurrency safe implementation.
Bonus: Also switch grid implementation.
we must attempt to convert all errors at storage-rest-client
into StorageErr() regardless of what functionality is being
called in, this PR fixes this for multiple callers including
some internally used functions.
- old version was unable to retain messages during config reload
- old version could not go from memory to disk during reload
- new version can batch disk queue entries to single for to reduce I/O load
- error logging has been improved, previous version would miss certain errors.
- logic for spawning/despawning additional workers has been adjusted to trigger when half capacity is reached, instead of when the log queue becomes full.
- old version would json marshall x2 and unmarshal 1x for every log item. Now we only do marshal x1 and then we GetRaw from the store and send it without having to re-marshal.
panic seen due to premature closing of slow channel while listing is still sending or
list has already closed on the sender's side:
```
panic: close of closed channel
goroutine 13666 [running]:
github.com/minio/minio/internal/ioutil.SafeClose[...](0x101ff51e4?)
/Users/kp/code/src/github.com/minio/minio/internal/ioutil/ioutil.go:425 +0x24
github.com/minio/minio/cmd.(*erasureServerPools).Walk.func1()
/Users/kp/code/src/github.com/minio/minio/cmd/erasure-server-pool.go:2142 +0x170
created by github.com/minio/minio/cmd.(*erasureServerPools).Walk in goroutine 1189
/Users/kp/code/src/github.com/minio/minio/cmd/erasure-server-pool.go:1985 +0x228
```
Object names of directory objects qualified for ExpiredObjectAllVersions
must be encoded appropriately before calling on deletePrefix on their
erasure set.
e.g., a directory object and regular objects with overlapping prefixes
could lead to the expiration of regular objects, which is not the
intention of ILM.
```
bucket/dir/ ---> directory object
bucket/dir/obj-1
```
When `bucket/dir/` qualifies for expiration, the current implementation would
remove regular objects under the prefix `bucket/dir/`, in this case,
`bucket/dir/obj-1`.
In handlers related to health diagnostics e.g. CPU, Network, Partitions,
etc, globalMinioHost was being passed as the addr, resulting in empty
value for the same in the health report.
Using globalLocalNodeName instead fixes the issue.
IAM loading is a lazy operation, allow these
fallbacks to be in place when we cannot find
in-memory state().
this allows us to honor the request even if pay
a small price for lookup and populating the data.
When objects have more versions than their ILM policy expects to retain
via NewerNoncurrentVersions, but they don't qualify for expiry due to
NoncurrentDays are configured in that rule.
In this case, applyNewerNoncurrentVersionsLimit method was enqueuing empty
tasks, which lead to a panic (panic: runtime error: index out of range [0] with
length 0) in newerNoncurrentTask.OpHash method, which assumes the task
to contain at least one version to expire.
When returning the status of a decommissioned pool, a pool with zero
time StartedTime will be considered an active pool, which is unexpected.
This commit will always ensure that a pool's canceled/failed/completed
status is returned.
This commit changes how MinIO generates the object encryption key (OEK)
when encrypting an object using server-side encryption.
This change is fully backwards compatible. Now, MinIO generates
the OEK as following:
```
Nonce = RANDOM(32) // generate 256 bit random value
OEK = HMAC-SHA256(EK, Context || Nonce)
```
Before, the OEK was computed as following:
```
Nonce = RANDOM(32) // generate 256 bit random value
OEK = SHA256(EK || Nonce)
```
The new scheme does not technically fix a security issue but
uses a more familiar scheme. The only requirement for the
OEK generation function is that it produces a (pseudo)random value
for every pair (`EK`,`Nonce`) as long as no `EK`-`Nonce` combination
is repeated. This prevents a faulty PRNG from repeating or generating
a "bad" key.
The previous scheme guarantees that the `OEK` is a (pseudo)random
value given that no pair (`EK`,`Nonce`) repeats under the assumption
that SHA256 is indistinguable from a random oracle.
The new scheme guarantees that the `OEK` is a (pseudo)random value
given that no pair (`EK`, `Nonce`) repeats under the assumption that
SHA256's underlying compression function is a PRF/PRP.
While the later is a weaker assumption, and therefore, less likely
to be false, both are considered true. SHA256 is believed to be
indistinguable from a random oracle AND its compression function
is assumed to be a PRF/PRP.
As far as the OEK generating is concerned, the OS random number
generator is not required to be pseudo-random but just non-repeating.
Apart from being more compatible to standard definitions and
descriptions for how to generate crypto. keys, this change does not
have any impact of the actual security of the OEK key generation.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
avoids error during upgrades such as
```
API: SYSTEM()
Time: 19:19:22 UTC 03/18/2024
DeploymentID: 24e4b574-b28d-4e94-9bfa-03c363a600c2
Error: Invalid api configuration: found invalid keys (expiry_workers=100 ) for 'api' sub-system, use 'mc admin config reset myminio api' to fix invalid keys (*fmt.wrapError)
11: internal/logger/logger.go:260:logger.LogIf()
...
```
we were prematurely not writing 4k pages while we
could have due to the fact that most buffers would
be multiples of 4k upto some number and there shall
be some remainder.
We only need to write the remainder without O_DIRECT.
at scale customers might start with failed drives,
causing skew in the overall usage ratio per EC set.
make this configurable such that customers can turn
this off as needed depending on how comfortable they
are.
Cosmetic change, but breaks up a big code block and will make a goroutine
dumps of streams are more readable, so it is clearer what each goroutine is doing.
Currently, the code relies on object parity to decide whether it is a
delete marker or a regular object. In the case of a delete marker, the
return quorum is half of the disks in the erasure set. However, this
calculation must be corrected with objects with EC = 0, mainly
because EC is not a one-time fixed configuration.
Though all data are correct, the manifested symptom is a 503 with an
EC=0 object. This bug was manifested after we introduced the
fast Get Object feature that does not read all data from all disks in
case of inlined objects
Metrics v3 is mainly a reorganization of metrics into smaller groups of
metrics and the removal of internal aggregation of metrics received from
peer nodes in a MinIO cluster.
This change adds the endpoint `/minio/metrics/v3` as the top-level metrics
endpoint and under this, various sub-endpoints are implemented. These
are currently documented in `docs/metrics/v3.md`
The handler will serve metrics at any path
`/minio/metrics/v3/PATH`, as follows:
when PATH is a sub-endpoint listed above => serves the group of
metrics under that path; or when PATH is a (non-empty) parent
directory of the sub-endpoints listed above => serves metrics
from each child sub-endpoint of PATH. otherwise, returns a no
resource found error
All available metrics are listed in the `docs/metrics/v3.md`. More will
be added subsequently.
When an object qualifies for both tiering and expiration rules and is
past its expiration date, it should be expired without requiring to tier
it, even when tiering event occurs before expiration.
Merging same-object - multiple versions from different pools would not always result in correct ordering.
When merging keep inputs separate.
```
λ mc ls --versions local/testbucket
------ before ------
[2024-03-05 20:17:19 CET] 228B STANDARD 1f163718-9bc5-4b01-bff7-5d8cf09caf10 v3 PUT hosts
[2024-03-05 20:19:56 CET] 19KiB STANDARD null v2 PUT hosts
[2024-03-05 20:17:15 CET] 228B STANDARD 73c9f651-f023-4566-b012-cc537fdb7ce2 v1 PUT hosts
------ after ------
λ mc ls --versions local/testbucket
[2024-03-05 20:19:56 CET] 19KiB STANDARD null v3 PUT hosts
[2024-03-05 20:17:19 CET] 228B STANDARD 1f163718-9bc5-4b01-bff7-5d8cf09caf10 v2 PUT hosts
[2024-03-05 20:17:15 CET] 228B STANDARD 73c9f651-f023-4566-b012-cc537fdb7ce2 v1 PUT hosts
```
configure batch size to send audit/logger events
in batches instead of sending one event per connection.
this is mainly to optimize the number of requests
we make to webhook endpoint.
Currently, the progress of the batch job is saved in inside the job
request object, which is normally not supported by MinIO. Though there
is no apparent bug, it is better to fix this now.
Batch progress is saved in .minio.sys/batch-jobs/reports/
Co-authored-by: Anis Eleuch <anis@min.io>
our PoolNumber calculation was costly,
while we already had this information per
endpoint, we needed to deduce it appropriately.
This PR addresses this by assigning PoolNumbers
field that carries all the pool numbers that
belong to a server.
properties.PoolNumber still carries a valid value
only when len(properties.PoolNumbers) == 1, otherwise
properties.PoolNumber is set to math.MaxInt (indicating
that this value is undefined) and then one must rely
on properties.PoolNumbers for server participation
in multiple pools.
addresses the issue originating from #11327
simplify audit webhook worker model
fixes couple of bugs like
- ping(ctx) was creating a logger without updating
number of workers leading to incorrect nWorkers
scaling, causing an additional worker that is not
tracked properly.
- h.logCh <- entry could potentially hang for when
the queue is full on heavily loaded systems.
there can be a sudden spike in tiny allocations,
due to too much auditing being done, also don't hang
on the
```
h.logCh <- entry
```
after initializing workers if you do not have a way to
dequeue for some reason.
This commits adds support for using the `--endpoint` arg when creating a
tier of type `azure`. This is needed to connect to Azure's Gov Cloud
instance. For example,
```
mc ilm tier add azure TARGET TIER_NAME \
--account-name ACCOUNT \
--account-key KEY \
--bucket CONTAINER \
--endpoint https://ACCOUNT.blob.core.usgovcloudapi.net
--prefix PREFIX \
--storage-class STORAGE_CLASS
```
Prior to this, the endpoint was hardcoded to `https://ACCOUNT.blob.core.windows.net`.
The docs were even explicit about this, stating that `--endpoint` is:
"Required for `s3` or `minio` tier types. This option has no effect for any
other value of `TIER_TYPE`."
Now, if the endpoint arg is present it will be used. If not, it will
fall back to the same default behavior of `ACCOUNT.blob.core.windows.net`.
Remove api.expiration_workers config setting which was inadvertently left behind. Per review comment
https://github.com/minio/minio/pull/18926, expiration_workers can be configured via ilm.expiration_workers.
ext4, xfs support this behavior however
btrfs, nfs may not support it properly.
in-case when we see Nlink < 2 then we know
that we need to fallback on readdir()
fixes a regression from #19100fixes#19181
The middleware sets up tracing, throttling, gzipped responses and
collecting API stats.
Additionally, this change updates the names of handler functions in
metric labels to be the same as the name derived from Go lang reflection
on the handler name.
The metric api labels are now stored in memory the same as the handler
name - they will be camelcased, e.g. `GetObject` instead of `getobject`.
For compatibility, we lowercase the metric api label values when emitting the metrics.
- Use a shared worker pool for all ILM expiry tasks
- Free version cleanup executes in a separate goroutine
- Add a free version only if removing the remote object fails
- Add ILM expiry metrics to the node namespace
- Move tier journal tasks to expiryState
- Remove unused on-disk journal for tiered objects pending deletion
- Distribute expiry tasks across workers such that the expiry of versions of
the same object serialized
- Ability to resize worker pool without server restart
- Make scaling down of expiryState workers' concurrency safe; Thanks
@klauspost
- Add error logs when expiryState and transition state are not
initialized (yet)
* metrics: Add missed tier journal entry tasks
* Initialize the ILM worker pool after the object layer
With this commit, MinIO generates root credentials automatically
and deterministically if:
- No root credentials have been set.
- A KMS (KES) is configured.
- API access for the root credentials is disabled (lockdown mode).
Before, MinIO defaults to `minioadmin` for both the access and
secret keys. Now, MinIO generates unique root credentials
automatically on startup using the KMS.
Therefore, it uses the KMS HMAC function to generate pseudo-random
values. These values never change as long as the KMS key remains
the same, and the KMS key must continue to exist since all IAM data
is encrypted with it.
Backward compatibility:
This commit should not cause existing deployments to break. It only
changes the root credentials of deployments that have a KMS configured
(KES, not a static key) but have not set any admin credentials. Such
implementations should be rare or not exist at all.
Even if the worst case would be updating root credentials in mc
or other clients used to administer the cluster. Root credentials
are anyway not intended for regular S3 operations.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
just like client-conn-read-deadline, added a new flag that does
client-conn-write-deadline as well.
Both are not configured by default, since we do not yet know
what is the right value. Allow this to be configurable if needed.
we should do this to ensure that we focus on
data healing as primary focus, fixing metadata
as part of healing must be done but making
data available is the main focus.
the main reason is metadata inconsistencies can
cause data availability issues, which must be
avoided at all cost.
will be bringing in an additional healing mechanism
that involves "metadata-only" heal, for now we do
not expect to have these checks.
continuation of #19154
Bonus: add a pro-active healthcheck to perform a connection
This change makes the label names consistent with the handler names.
This is in preparation to use reflection based API handler function
names for the api labels so they will be the same as tracing, auditing
and logging names for these API calls.
Moved different dashboards to their specific directories. Also
mentioned that these dashbards are examples of how to create
graphs using MinIO provided and metrics and customers should
change / add graphs on their specific need basis.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Streams can return errors if the cancelation is picked up before the response
stream close is picked up. Under extreme load, this could lead to missing
responses.
Send server mux ack async so a blocked send cannot block newMuxStream
call. Stream will not progress until mux has been acked.
in k8s things really do come online very asynchronously,
we need to use implementation that allows this randomness.
To facilitate this move WriteAll() as part of the
websocket layer instead.
Bonus: avoid instances of dnscache usage on k8s
New disk healing code skips/expires objects that ILM supposed to expire.
Add more visibility to the user about this activity by calculating those
objects and print it at the end of healing activity.
This PR fixes a bug that perhaps has been long introduced,
with no visible workarounds. In any deployment, if an entire
erasure set is deleted, there is no way the cluster recovers.
Currently, if one object tag matches with one lifecycle tag filter, ILM
will select it, however, this is wrong. All the Tag filters in the
lifecycle document should be satisfied.
This change is to decouple need for root credentials to match between
site replication deployments.
Also ensuring site replication config initialization is re-tried until
it succeeds, this deoendency is critical to STS flow in site replication
scenario.
Currently, we read from `/proc/diskstats` which is found to be
un-reliable in k8s environments. We can read from `sysfs` instead.
Also, cache the latest drive io stats to find the diff and update
the metrics.
* Remove lock for cached operations.
* Rename "Relax" to `ReturnLastGood`.
* Add `CacheError` to allow caching values even on errors.
* Add NoWait that will return current value with async fetching if within 2xTTL.
* Make benchmark somewhat representative.
```
Before: BenchmarkCache-12 16408370 63.12 ns/op 0 B/op
After: BenchmarkCache-12 428282187 2.789 ns/op 0 B/op
```
* Remove `storageRESTClient.scanning`. Nonsensical - RPC clients will not have any idea about scanning.
* Always fetch remote diskinfo metrics and cache them. Seems most calls are requesting metrics.
* Do async fetching of usage caches.
It also fixes a long-standing bug in expiring transitioned objects.
The expiration action was deleting the current version in the case'
of tiered objects instead of adding a delete marker.
only enable md5sum if explicitly asked by the client, otherwise
its not necessary to compute md5sum when SSE-KMS/SSE-C is enabled.
this is continuation of #17958
If network conditions have filled the output queue before a reconnect happens blocked sends could stop reconnects from happening. In short `respMu` would be held for a mux client while sending - if the queue is full this will never get released and closing the mux client will hang.
A) Use the mux client context instead of connection context for sends, so sends are unblocked when the mux client is canceled.
B) Use a `TryLock` on "close" and cancel the request if we cannot get the lock at once. This will unblock any attempts to send.
data-dir not being present is okay, however we can still
rely on the `rename()` atomic call instead of relying on
write xl.meta write which may truncate the io.EOF.
Add a new function logger.Event() to send the log to Console and
http/kafka log webhooks. This will include some internal events such as
disk healing and rebalance/decommissioning
the PR in #16541 was incorrect and hand wrong assumptions
about the overall setup, revert this since this expectation
to have offline servers is wrong and we can end up with a
bigger chicken and egg problem.
This reverts commit 5996c8c4d5.
Bonus:
- preserve disk in globalLocalDrives properly upon connectDisks()
- do not return 'nil' from newXLStorage(), getting it ready for
the next set of changes for 'format.json' loading.
The previous logic of calculating per second values for disk io stats
divides the stats by the host uptime. This doesn't work in k8s
environment as the uptime is of the pod, but the stats (from
/proc/diskstats) are from the host.
Fix this by storing the initial values of uptime and the stats at the
timme of server startup, and using the difference between current and
initial values when calculating the per second values.
globalLocalDrives seem to be not updated during the
HealFormat() leads to a requirement where the server
needs to be restarted for the healing to continue.
a/prefix
a/prefix/1.txt
where `a/prefix` is an object which does not have `/` at the end,
we do not have to aggressively recursively delete all the sub-folders
as well. Instead convert the call into self contained to deleting
'xl.meta' and then subsequently attempting to Remove the parent.
Bonus: enable audit alerts for object versions
beyond the configured value, default is '100'
versions per object beyond which scanner will
alert for each such objects.
when we expand via pools, there is no reason to stick
with the same distributionAlgo as the rest. Since the
algo only makes sense with-in a pool not across pools.
This allows for newer pools to use newer codepaths to
avoid legacy file lookups when they have a pre-existing
deployment from 2019, they can expand their new pool
to be of a newer distribution format, allowing the
pool to be more performant.
Fix reported races that are actually synchronized by network calls.
But this should add some extra safety for untimely disconnects.
Race reported:
```
WARNING: DATA RACE
Read at 0x00c00171c9c0 by goroutine 214:
github.com/minio/minio/internal/grid.(*muxClient).addResponse()
e:/gopath/src/github.com/minio/minio/internal/grid/muxclient.go:519 +0x111
github.com/minio/minio/internal/grid.(*muxClient).error()
e:/gopath/src/github.com/minio/minio/internal/grid/muxclient.go:470 +0x21d
github.com/minio/minio/internal/grid.(*Connection).handleDisconnectClientMux()
e:/gopath/src/github.com/minio/minio/internal/grid/connection.go:1391 +0x15b
github.com/minio/minio/internal/grid.(*Connection).handleMsg()
e:/gopath/src/github.com/minio/minio/internal/grid/connection.go:1190 +0x1ab
github.com/minio/minio/internal/grid.(*Connection).handleMessages.func1()
e:/gopath/src/github.com/minio/minio/internal/grid/connection.go:981 +0x610
Previous write at 0x00c00171c9c0 by goroutine 1081:
github.com/minio/minio/internal/grid.(*muxClient).roundtrip()
e:/gopath/src/github.com/minio/minio/internal/grid/muxclient.go:94 +0x324
github.com/minio/minio/internal/grid.(*muxClient).traceRoundtrip()
e:/gopath/src/github.com/minio/minio/internal/grid/trace.go:74 +0x10e4
github.com/minio/minio/internal/grid.(*Subroute).Request()
e:/gopath/src/github.com/minio/minio/internal/grid/connection.go:366 +0x230
github.com/minio/minio/internal/grid.(*SingleHandler[go.shape.*github.com/minio/minio/cmd.DiskInfoOptions,go.shape.*github.com/minio/minio/cmd.DiskInfo]).Call()
e:/gopath/src/github.com/minio/minio/internal/grid/handlers.go:554 +0x3fd
github.com/minio/minio/cmd.(*storageRESTClient).DiskInfo()
e:/gopath/src/github.com/minio/minio/cmd/storage-rest-client.go:314 +0x270
github.com/minio/minio/cmd.erasureObjects.getOnlineDisksWithHealingAndInfo.func1()
e:/gopath/src/github.com/minio/minio/cmd/erasure.go:293 +0x171
```
This read will always happen after the write, since there is a network call in between.
However a disconnect could come in while we are setting up the call, so we protect against that with extra checks.
- bucket metadata does not need to look for legacy things
anymore if b.Created is non-zero
- stagger bucket metadata loads across lots of nodes to
avoid the current thundering herd problem.
- Remove deadlines for RenameData, RenameFile - these
calls should not ever be timed out and should wait
until completion or wait for client timeout. Do not
choose timeouts for applications during the WRITE phase.
- increase R/W buffer size, increase maxMergeMessages to 30
We have observed cases where a blocked stream will block for cancellations.
This happens when response channel is blocked and we want to push an error.
This will have the response mutex locked, which will prevent all other operations until upstream is unblocked.
Make this behavior non-blocking and if blocked spawn a goroutine that will send the response and close the output.
Still a lot of "dancing". Added a test for this and reviewed.
Depending on when the context cancelation is picked up the handler may return and close the channel before `SubscribeJSON` returns, causing:
```
Feb 05 17:12:00 s3-us-node11 minio[3973657]: panic: send on closed channel
Feb 05 17:12:00 s3-us-node11 minio[3973657]: goroutine 378007076 [running]:
Feb 05 17:12:00 s3-us-node11 minio[3973657]: github.com/minio/minio/internal/pubsub.(*PubSub[...]).SubscribeJSON.func1()
Feb 05 17:12:00 s3-us-node11 minio[3973657]: github.com/minio/minio/internal/pubsub/pubsub.go:139 +0x12d
Feb 05 17:12:00 s3-us-node11 minio[3973657]: created by github.com/minio/minio/internal/pubsub.(*PubSub[...]).SubscribeJSON in goroutine 378010884
Feb 05 17:12:00 s3-us-node11 minio[3973657]: github.com/minio/minio/internal/pubsub/pubsub.go:124 +0x352
```
Wait explicitly for the goroutine to exit.
Bonus: Listen for doneCh when sending to not risk getting blocked there is channel isn't being emptied.
this fixes rare bugs we have seen but never really found a
reproducer for
- PutObjectRetention() returning 503s
- PutObjectTags() returning 503s
- PutObjectMetadata() updates during replication returning 503s
These calls return errors, and this perpetuates with
no apparent fix.
This PR fixes with correct quorum requirement.
To force limit the duration of STS accounts, the user can create a new
policy, like the following:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["sts:AssumeRoleWithWebIdentity"],
"Condition": {"NumericLessThanEquals": {"sts:DurationSeconds": "300"}}
}]
}
And force binding the policy to all OpenID users, whether using a claim name or role
ARN.
disk tokens usage is not necessary anymore with the implementation
of deadlines for storage calls and active monitoring of the drive
for I/O timeouts.
Functionality kicking off a bad drive is still supported, it's just that
we do not have to serialize I/O in the manner tokens would do.
Recycle would always be called on the dummy value `any(newRT())` instead of the actual value given to the recycle function.
Caught by race tests, but mostly harmless, except for reduced perf.
Other minor cleanups. Introduced in #18940 (unreleased)
Interpret `null` inline policy for access keys as inheriting parent
policy. Since MinIO Console currently sends this value, we need to honor it
for now. A larger fix in Console and in the server are required.
Fixes#18939.
Allow internal types to support a `Recycler` interface, which will allow for sharing of common types across handlers.
This means that all `grid.MSS` (and similar) objects are shared across in a common pool instead of a per-handler pool.
Add internal request reuse of internal types. Add for safe (pointerless) types explicitly.
Only log params for internal types. Doing Sprint(obj) is just a bit too messy.
With this change, only a user with `UpdateServiceAccountAdminAction`
permission is able to edit access keys.
We would like to let a user edit their own access keys, however the
feature needs to be re-designed for better security and integration with
external systems like AD/LDAP and OpenID.
This change prevents privilege escalation via service accounts.
for actionable, inspections we have `mc support inspect`
we do not need double logging, healing will report relevant
errors if any, in terms of quorum lost etc.
Each Put, List, Multipart operations heavily rely on making
GetBucketInfo() call to verify if bucket exists or not on
a regular basis. This has a large performance cost when there
are tons of servers involved.
We did optimize this part by vectorizing the bucket calls,
however its not enough, beyond 100 nodes and this becomes
fairly visible in terms of performance.
- healing must not set the write xattr
because that is the job of active healing
to update. what we need to preserve is
permanent deletes.
- remove older env for drive monitoring and
enable it accordingly, as a global value.
local disk metrics were polluting cluster metrics
Please remove them instead of adding relevant ones.
- batch job metrics were incorrectly kept at bucket
metrics endpoint, move it to cluster metrics.
- add tier metrics to cluster peer metrics from the node.
- fix missing set level cluster health metrics
Do not rely on `connChange` to do reconnects.
Instead, you can block while the connection is running and reconnect
when handleMessages returns.
Add fully async monitoring instead of monitoring on the main goroutine
and keep this to avoid full network lockup.
it is entirely possible that a rebalance process which was running
when it was asked to "stop" it failed to write its last statistics
to the disk.
After this a pool expansion can cause disruption and all S3 API
calls would fail at IsPoolRebalancing() function.
This PRs makes sure that we update rebalance.bin under such
conditions to avoid any runtime crashes.
add new update v2 that updates per node, allows idempotent behavior
new API ensures that
- binary is correct and can be downloaded checksummed verified
- committed to actual path
- restart returns back the relevant waiting drives
do not need to be defensive in our approach,
we should simply override anything everything
in import process, do not care about what
currently exists on the disk - backup is the
source of truth.
Right now the format.json is excluded if anything within `.minio.sys` is requested.
I assume the check was meant to exclude only if it was actually requesting it.
- Move RenameFile to websockets
- Move ReadAll that is primarily is used
for reading 'format.json' to to websockets
- Optimize DiskInfo calls, and provide a way
to make a NoOp DiskInfo call.
Add separate reconnection mutex
Give more safety around reconnects and make sure a state change isn't missed.
Tested with several runs of `λ go test -race -v -count=500`
Adds separate mutex and doesn't mix in the testing mutex.
AlmosAll uses of NewDeadlineWorker, which relied on secondary values, were used in a racy fashion,
which could lead to inconsistent errors/data being returned. It also propagates the deadline downstream.
Rewrite all these to use a generic WithDeadline caller that can return an error alongside a value.
Remove the stateful aspect of DeadlineWorker - it was racy if used - but it wasn't AFAICT.
Fixes races like:
```
WARNING: DATA RACE
Read at 0x00c130b29d10 by goroutine 470237:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:702 +0x611
github.com/minio/minio/cmd.readFileInfo()
github.com/minio/minio/cmd/erasure-metadata-utils.go:160 +0x122
github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.1()
github.com/minio/minio/cmd/erasure-object.go:809 +0x27a
github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.2()
github.com/minio/minio/cmd/erasure-object.go:828 +0x61
Previous write at 0x00c130b29d10 by goroutine 470298:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion.func1()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:698 +0x244
github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33
WARNING: DATA RACE
Write at 0x00c0ba6e6c00 by goroutine 94507:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol.func1()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:419 +0x104
github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33
Previous read at 0x00c0ba6e6c00 by goroutine 94463:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:422 +0x47e
github.com/minio/minio/cmd.getBucketInfoLocal.func1()
github.com/minio/minio/cmd/peer-s3-server.go:275 +0x122
github.com/minio/pkg/v2/sync/errgroup.(*Group).Go.func1()
```
Probably back from #17701
protection was in place. However, it covered only some
areas, so we re-arranged the code to ensure we could hold
locks properly.
Along with this, remove the DataShardFix code altogether,
in deployments with many drive replacements, this can affect
and lead to quorum loss.
Also limit the amount of concurrency when sending
binary updates to peers, avoid high network over
TX that can cause disconnection events for the
node sending updates.
Race checks would occasionally show race on handleMsgWg WaitGroup by debug messages (used in test only).
Use the `connMu` mutex to protect this against concurrent Wait/Add.
Fixes#18827
If site replication is enabled, we should still show the size and
version distribution histogram metrics at bucket level.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
New API now verifies any hung disks before restart/stop,
provides a 'per node' break down of the restart/stop results.
Provides also how many blocked syscalls are present on the
drives and what users must do about them.
Adds options to do pre-flight checks to provide information
to the user regarding any hung disks. Provides 'force' option
to forcibly attempt a restart() even with waiting syscalls
on the drives.
When rejecting incoming grid requests fill out the rejection reason and log it once.
This will give more context when startup is failing. Already logged after a retry on caller.
On a policy detach operation, if there are no policies remaining
attached to the user/group, remove the policy mapping file, instead of
leaving a file containing an empty list of policies.
Healing dangling buckets is conservative, and it is a typical use case to
fail to remove a dangling bucket because it contains some data because
healing danging bucket code is not allowed to remove data: only healing
the dangling object is allowed to do so.
reference format is constant for any lifetime of
a minio cluster, we do not have to ever replace
it during HealFormat() as it will never change.
additionally we should simply reject reference
formats that we do not understand early on.
GetActualSize() was heavily relying on o.Parts()
to be non-empty to figure out if the object is multipart or not,
However, we have many indicators of whether an object is multipart
or not.
Blindly assuming that o.Parts == nil is not a multipart, is an
incorrect expectation instead, multipart must be obtained via
- Stored metadata value indicating this is a multipart encrypted object.
- Rely on <meta>-actual-size metadata to get the object's actual size.
This value is preserved for additional reasons such as these.
- ETag != 32 length
support proxying of tagging requests in active-active replication
Note: even if proxying is successful, PutObjectTagging/DeleteObjectTagging
will continue to report a 404 since the object is not present locally.
New intervals:
[1024B, 64KiB)
[64KiB, 256KiB)
[256KiB, 512KiB)
[512KiB, 1MiB)
The new intervals helps us see object size distribution with higher
resolution for the interval [1024B, 1MiB).
- HealFormat() was leaking healthcheck goroutines for
disks, we are only interested in enabling healthcheck
for the newly formatted disk, not for existing disks.
- When disk is a root-disk a random disk monitor was
leaking while we ignored the drive.
- When loading the disk for each erasure set, we were
leaking goroutines for the prepare-storage.go disks
which were replaced via the globalLocalDrives slice
- avoid disk monitoring utilizing health tokens that
would cause exhaustion in the tokens, prematurely
which were meant for incoming I/O. This is ensured
by avoiding writing O_DIRECT aligned buffer instead
write 2048 worth of content only as O_DSYNC, which is
sufficient.
Add a hidden configuration under the scanner sub section to configure if
the scanner should sleep between two objects scan. The configuration has
only effect when there is no drive activity related to s3 requests or
healing.
By default, the code will keep the current behavior which is doing
sleep between objects.
To forcefully enable the full scan speed in idle mode, you can do this:
`mc admin config set myminio scanner idle_speed=full`
fixes#18724
A regression was introduced in #18547, that attempted
to file adding a missing `null` marker however we
should not skip returning based on versionID instead
it must be based on if we are being asked to create
a DEL marker or not.
The PR also has a side-affect for replicating `null`
marker permanent delete, as it may end up adding a
`null` marker while removing one.
This PR should address both scenarios.
NOTE: This feature is not retro-active; it will not cater to previous transactions
on existing setups.
To enable this feature, please set ` _MINIO_DRIVE_QUORUM=on` environment
variable as part of systemd service or k8s configmap.
Once this has been enabled, you need to also set `list_quorum`.
```
~ mc admin config set alias/ api list_quorum=auto`
```
A new debugging tool is available to check for any missing counters.
Following policies if present
```
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"54.240.143.0/24",
"2001:DB8:1234:5678::/64"
]
}
}
```
And client is making a request to MinIO via IPv6 can
potentially crash the server.
Workarounds are turn-off IPv6 and use only IPv4
This PR also increases per node bpool memory from 1024 entries
to 2048 entries; along with that, it also moves the byte pool
centrally instead of being per pool.
minio_node_tier_ttlb_seconds - Distribution of time to last byte for streaming objects from warm tier
minio_node_tier_requests_success - Number of requests to download object from warm tier that were successful
minio_node_tier_requests_failure - Number of requests to download object from warm tier that failed
SUBNET now has a v2 of license that is returned in the new key
`license_v2`. mc will start reading and storing the same. (The old key
`license` is deprecated but is still available in SUBNET response to
ensure that the current released version of minio doesn't break)
`(*xlStorageDiskIDCheck).CreateFile` wraps the incoming reader in `xioutil.NewDeadlineReader`.
The wrapped reader is handed to `(*xlStorage).CreateFile`. This performs a Read call via `writeAllDirect`,
which reads into an `ODirectPool` buffer.
`(*DeadlineReader).Read` spawns an async read into the buffer. If a timeout is hit while reading,
the read operation returns to `writeAllDirect`. The operation returns an error and the buffer is reused.
However, if the async `Read` call unblocks, it will write to the now recycled buffer.
Fix: Remove the `DeadlineReader` - it is inherently unsafe. Instead, rely on the network timeouts.
This is not a disk timeout, anyway.
Regression in https://github.com/minio/minio/pull/17745
This patch adds the targetID to the existing notification target metrics
and deprecates the current target metrics which points to the overall
event notification subsystem
historically, we have always kept storage-rest-server
and a local storage API separate without much trouble,
since they both can independently operate due to no
special state() between them.
however, over some time, we have added state()
such as
- drive monitoring threads now there will be "2" of
them per drive instead of just 1.
- concurrent tokens available per drive are now twice
instead of just single shared, allowing unexpectedly
high amount of I/O to go through.
- applying serialization by using walkMutexes can now
be adequately honored for both remote callers and local
callers.
Regression from #18285. CopyObject options were inheriting source MTime
for metadata timestamps if unspecified, removing this prevented metadata
updates from being applied on target.
The metrics `minio_bucket_replication_received_bytes` and
`minio_bucket_replication_sent_bytes` are additive in nature
and rendering the value as is looks fine.
Also added sort order for few graphs for better reading of tool
tips as keeping ones with highest value at top helps.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
By default the cpu load is the cumulative of all cores. Capture the
percentage load (load * 100 / cpu-count)
Also capture the percentage memory used (used * 100 / total)
use memory for async events when necessary and dequeue them as
needed, for all synchronous events customers must enable
```
MINIO_API_SYNC_EVENTS=on
```
Async events can be lost but is upto to the admin to
decide what they want, we will not create run-away number
of goroutines per event instead we will queue them properly.
Currently the max async workers is set to runtime.GOMAXPROCS(0)
which is more than sufficient in general, but it can be made
configurable in future but may not be needed.
there is potential for danglingWrites when quorum failed, where
only some drives took a successful write, generally this is left
to the healing routine to pick it up. However it is better that
we delete it right away to avoid potential for quorum issues on
version signature when there are many versions of an object.
it is okay if the warm-tier cannot keep up, we should continue
to take I/O at hot-tier, only fail hot-tier or block it when
we are disk full.
Bonus: add metrics counter for these missed tasks, we will
know for sure if one of the node is lagging behind or is
losing too many tasks during transitioning.
A disk that is not able to initialize when an instance is started
will never have a handler registered, which means a user will
need to restart the node after fixing the disk;
This will also prevent showing the wrong 'upgrade is needed.'
error message in that case.
When the disk is still failing, print an error every 30 minutes;
Disk reconnection will be retried every 30 seconds.
Co-authored-by: Anis Elleuch <anis@min.io>
`OpMuxConnectError` was not handled correctly.
Remove local checks for single request handlers so they can
run before being registered locally.
Bonus: Only log IAM bootstrap on startup.
```
using deb packager...
created package: minio-release/linux-amd64/minio_20231120224007.0.0.hotfix.e96ac7272_amd64.deb
using rpm packager...
created package: minio-release/linux-amd64/minio-20231120224007.0.0.hotfix.e96ac7272-1.x86_64.rpm
```
While healing the latest changes of expiry rules across sites
if target had pre existing transition rules, they were getting
overwritten as cloned latest expiry rules from remote site were
getting written as is. Fixed the same and added test cases as
well.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
moveToTrash() function moves a folder to .trash, for example, when
doing some object deletions: a data dir that has many parts will be
renamed to the trash folder; However, ENOSPC is a valid error from
rename(), and it can cripple a user trying to free some space in an
entire disk situation.
Therefore, this commit will try to do a recursive delete in that case.
This allows batch replication to basically do not
attempt to copy objects that do not have read quorum.
This PR also allows walk() to provide custom
values for quorum under batch replication, and
key rotation.
this PR allows following policy
```
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Deny a presigned URL request if the signature is more than 10 min old",
"Effect": "Deny",
"Action": "s3:*",
"Resource": "arn:aws:s3:::DOC-EXAMPLE-BUCKET1/*",
"Condition": {
"NumericGreaterThan": {
"s3:signatureAge": 600000
}
}
}
]
}
```
This is to basically disable all pre-signed URLs that are older than 10 minutes.
AWS S3 closes keep-alive connections frequently
leading to frivolous logs filling up the MinIO
logs when the transition tier is an AWS S3 bucket.
Ignore such transient errors, let MinIO retry
it when it can.
When minio runs with MINIO_CI_CD=on, it is expected to communicate
with the locally running SUBNET. This is happening in the case of MinIO
via call home functionality. However, the subnet-related functionality inside the
console continues to talk to the SUBNET production URL. Because of this,
the console cannot be tested with a locally running SUBNET.
Set the env variable CONSOLE_SUBNET_URL correctly in such cases.
(The console already has code to use the value of this variable
as the subnet URL)
Optionally allows customers to enable
- Enable an external cache to catch GET/HEAD responses
- Enable skipping disks that are slow to respond in GET/HEAD
when we have already achieved a quorum
Bonus: allow replication to attempt Deletes/Puts when
the remote returns quorum errors of some kind, this is
to ensure that MinIO can rewrite the namespace with the
latest version that exists on the source.
This PR adds a WebSocket grid feature that allows servers to communicate via
a single two-way connection.
There are two request types:
* Single requests, which are `[]byte => ([]byte, error)`. This is for efficient small
roundtrips with small payloads.
* Streaming requests which are `[]byte, chan []byte => chan []byte (and error)`,
which allows for different combinations of full two-way streams with an initial payload.
Only a single stream is created between two machines - and there is, as such, no
server/client relation since both sides can initiate and handle requests. Which server
initiates the request is decided deterministically on the server names.
Requests are made through a mux client and server, which handles message
passing, congestion, cancelation, timeouts, etc.
If a connection is lost, all requests are canceled, and the calling server will try
to reconnect. Registered handlers can operate directly on byte
slices or use a higher-level generics abstraction.
There is no versioning of handlers/clients, and incompatible changes should
be handled by adding new handlers.
The request path can be changed to a new one for any protocol changes.
First, all servers create a "Manager." The manager must know its address
as well as all remote addresses. This will manage all connections.
To get a connection to any remote, ask the manager to provide it given
the remote address using.
```
func (m *Manager) Connection(host string) *Connection
```
All serverside handlers must also be registered on the manager. This will
make sure that all incoming requests are served. The number of in-flight
requests and responses must also be given for streaming requests.
The "Connection" returned manages the mux-clients. Requests issued
to the connection will be sent to the remote.
* `func (c *Connection) Request(ctx context.Context, h HandlerID, req []byte) ([]byte, error)`
performs a single request and returns the result. Any deadline provided on the request is
forwarded to the server, and canceling the context will make the function return at once.
* `func (c *Connection) NewStream(ctx context.Context, h HandlerID, payload []byte) (st *Stream, err error)`
will initiate a remote call and send the initial payload.
```Go
// A Stream is a two-way stream.
// All responses *must* be read by the caller.
// If the call is canceled through the context,
//The appropriate error will be returned.
type Stream struct {
// Responses from the remote server.
// Channel will be closed after an error or when the remote closes.
// All responses *must* be read by the caller until either an error is returned or the channel is closed.
// Canceling the context will cause the context cancellation error to be returned.
Responses <-chan Response
// Requests sent to the server.
// If the handler is defined with 0 incoming capacity this will be nil.
// Channel *must* be closed to signal the end of the stream.
// If the request context is canceled, the stream will no longer process requests.
Requests chan<- []byte
}
type Response struct {
Msg []byte
Err error
}
```
There are generic versions of the server/client handlers that allow the use of type
safe implementations for data types that support msgpack marshal/unmarshal.
With an odd number of drives per erasure set setup, the write/quorum is
the half + 1; however the decommissioning listing will still list those
objects and does not consider those as stale.
Fix it by using (N+1)/2 formula.
Co-authored-by: Anis Elleuch <anis@min.io>
Immediate transition use case and is mostly used to fill warm
backend with a lot of data when a new deployment is created
Currently, if the transition queue is complete, the transition will be
deferred to the scanner; change this behavior by blocking the PUT request
until the transition queue has a new place for a transition task.
Currently, once the audit becomes offline, there is no code that tries
to reconnect to the audit, at the same time Send() quickly returns with
an error without really trying to send a message the audit endpoint; so
the audit endpoint will never be online again.
Fixing this behavior; the current downside is that we miss printing some
logs when the audit becomes offline; however this information is
available in prometheus
Later, we can refactor internal/logger so the http endpoint can send errors to
console target.
Currently if the object does not exist in quorum disks of an erasure
set, the dangling code is never called because the returned error will
be errFileNotFound or errFileVersionNotFound;
With this commit, when errFileNotFound or errFileVersionNotFound is
returning when trying to calculate the quorum of a given object, the
code checks if a disk returned nil, which means a stale object exists in
that disk, that will trigger deleteIfDangling() function
This commit splits the liveness and readiness
handler into two separate handlers. In K8S, a
liveness probe is used to determine whether the
pod is in "live" state and functioning at all.
In contrast, the readiness probe is used to
determine whether the pod is ready to serve
requests.
A failing liveness probe causes pod restarts while
a failing readiness probe causes k8s to stop routing
traffic to the pod. Hence, a liveness probe should
be as robust as possible while a readiness probe
should be used to load balancing.
Ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
Signed-off-by: Andreas Auernhammer <github@aead.dev>
This patch takes care of loading the bucket configs of failed buckets
during the periodic refresh. This makes sure the event notifiers and
remote bucket targets are properly initialized.
users might use MinIO on NFS, GPFS that provide dynamic
inodes and may not even have a concept of free inodes.
to allow users to use MinIO on top of GPFS relax the
free inode check.
* creating a byte buffer for SFTP file segments
* Adding an error condition for when there are
remaining segments in the queue
* Simplification of the queue using a map
it is possible that ILM or Deletes got triggered on batch
of objects that we are attempting to batch replicate, ignore
this scenario as valid behavior.
sendfile implementation to perform DMA on all platforms
Go stdlib already supports sendfile/splice implementations
for
- Linux
- Windows
- *BSD
- Solaris
Along with this change however O_DIRECT for reads() must be
removed as well since we need to use sendfile() implementation
The main reason to add O_DIRECT for reads was to reduce the
chances of page-cache causing OOMs for MinIO, however it would
seem that avoiding buffer copies from user-space to kernel space
this issue is not a problem anymore.
There is no Go based memory allocation required, and neither
the page-cache is referenced back to MinIO. This page-
cache reference is fully owned by kernel at this point, this
essentially should solve the problem of page-cache build up.
With this now we also support SG - when NIC supports Scatter/Gather
https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing)
`monitorAndConnectEndpoints` will continue to attempt to reconnect offline disks.
Since disks were never closed, a `MarkOffline` would continue to try to check these disks forever.
Close previous disks.
replace io.Discard usage to fix NUMA copy() latencies
On NUMA systems copying from 8K buffer allocated via
io.Discard leads to large latency build-up for every
```
copy(new8kbuf, largebuf)
```
can in-cur upto 1ms worth of latencies on NUMA systems
due to memory sharding across NUMA nodes.
Fix various regressions from #18029
* If context is canceled the token is never returned. This will lead to scanner being unable to save and deadlocking.
* Fix backup not being able to get any data (hr empty)
* Reduce backup timeout.
Tiering statistics have been broken for some time now, a regression
was introduced in 6f2406b0b6
Bonus fixes an issue where the objects are not assumed to be
of the 'STANDARD' storage-class for the objects that have
not yet tiered, this should be conditional based on the object's
metadata not a default assumption.
This PR also does some cleanup in terms of implementation,
fixes#18070
https://github.com/minio/minio/pull/18307 partially removed the duplicate upload id check.
While I can't really see how ListDir can return duplicate entries, let's re-add it, since it is a cheap sanity check.
This commit changes the container base image
from ubi-minimal to ubi-micro.
The docker build process happens now in two stages.
The build stage:
- downloads the latest CA certificate bundle
- downloads MinIO binary (for requested version/os/arch)
- downloads MinIO binary signature and verifies it
using minisign
Then it creates an image based on ubi-micro with just
the minio binary was downloaded and verified during the
build stage.
The build stage is simplified to just verifying the
minisign signature.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
There can be rare situations where errors seen in bucket metadata
load on startup or subsequent metadata updates can result in missing
replication remotes.
Attempt a refresh of remote targets backed by a good replication config
lazily in 5 minute intervals if there ever occurs a situation where
remote targets go AWOL.
resync status may not be upto-date by
the time the resync is over due to how
the timer is triggered.
diff is sufficient to know if replication
happened or not.
`GetParityForSC` has a value receiver, so Config is copied before the lock is obtained.
Make it pointer receiver.
Fixes:
```
WARNING: DATA RACE
Read at 0x0000079cdd10 by goroutine 190:
github.com/minio/minio/cmd.(*erasureServerPools).BackendInfo()
github.com/minio/minio/cmd/erasure-server-pool.go:579 +0x6f
github.com/minio/minio/cmd.(*erasureServerPools).LocalStorageInfo()
github.com/minio/minio/cmd/erasure-server-pool.go:614 +0x3c6
github.com/minio/minio/cmd.(*peerRESTServer).LocalStorageInfoHandler()
github.com/minio/minio/cmd/peer-rest-server.go:347 +0x4ea
github.com/minio/minio/cmd.(*peerRESTServer).LocalStorageInfoHandler-fm()
...
WARNING: DATA RACE
Read at 0x0000079cdd10 by goroutine 190:
github.com/minio/minio/cmd.(*erasureServerPools).BackendInfo()
github.com/minio/minio/cmd/erasure-server-pool.go:579 +0x6f
github.com/minio/minio/cmd.(*erasureServerPools).LocalStorageInfo()
github.com/minio/minio/cmd/erasure-server-pool.go:614 +0x3c6
github.com/minio/minio/cmd.(*peerRESTServer).LocalStorageInfoHandler()
github.com/minio/minio/cmd/peer-rest-server.go:347 +0x4ea
github.com/minio/minio/cmd.(*peerRESTServer).LocalStorageInfoHandler-fm()
```
FROM registry.access.redhat.com/ubi8/ubi-micro:latest
ARG RELEASE
LABELname="MinIO"\
vendor="MinIO Inc <dev@min.io>"\
maintainer="MinIO Inc <dev@min.io>"\
version="${RELEASE}"\
release="${RELEASE}"\
summary="MinIO is a High Performance Object Storage, API compatible with Amazon S3 cloud storage service."\
description="MinIO object storage is fundamentally different. Designed for performance and the S3 API, it is 100% open-source. MinIO is ideal for large, private cloud environments with stringent security requirements and delivers mission-critical availability across a diverse range of workloads."
test-replication:install-racetest-replication-2sitetest-replication-3sitetest-delete-replicationtest-sio-errortest-delete-marker-proxying## verify multi site replication
@echo "Running tests for replicating three sites"
test-site-replication-ldap:install-race## verify automatic site replication
MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service. Use MinIO to build high performance infrastructure for machine learning, analytics and application data workloads.
MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service. Use MinIO to build high performance infrastructure for machine learning, analytics and application data workloads. To learn more about what MinIO is doing for AI storage, go to [AI storage documentation](https://min.io/solutions/object-storage-for-ai).
This README provides quickstart instructions on running MinIO on bare metal hardware, including container-based installations. For Kubernetes environments, use the [MinIO Kubernetes Operator](https://github.com/minio/operator/blob/master/README.md).
@@ -93,7 +93,6 @@ The following table lists supported architectures. Replace the `wget` URL with t
| 64-bit ARM | <https://dl.min.io/server/minio/release/linux-arm64/minio> |
| 64-bit PowerPC LE (ppc64le) | <https://dl.min.io/server/minio/release/linux-ppc64le/minio> |
| IBM Z-Series (S390X) | <https://dl.min.io/server/minio/release/linux-s390x/minio> |
The MinIO deployment starts using default root credentials `minioadmin:minioadmin`. You can test the deployment using the MinIO Console, an embedded web-based object browser built into MinIO Server. Point a web browser running on the host machine to <http://127.0.0.1:9000> and log in with the root credentials. You can use the Browser to create buckets, upload objects, and browse the contents of the MinIO server.
@@ -123,7 +122,7 @@ You can also connect using any S3-compatible tool, such as the MinIO Client `mc`
## Install from Source
Use the following commands to compile and run a standalone MinIO server from source. Source installation is only intended for developers and advanced users. If you do not have a working Golang environment, please follow [How to install Golang](https://golang.org/doc/install). Minimum version required is [go1.19](https://golang.org/dl/#stable)
Use the following commands to compile and run a standalone MinIO server from source. Source installation is only intended for developers and advanced users. If you do not have a working Golang environment, please follow [How to install Golang](https://golang.org/doc/install). Minimum version required is [go1.21](https://golang.org/dl/#stable)
```sh
go install github.com/minio/minio@latest
@@ -210,10 +209,6 @@ For deployments behind a load balancer, proxy, or ingress rule where the MinIO h
For example, consider a MinIO deployment behind a proxy `https://minio.example.net`, `https://console.minio.example.net` with rules for forwarding traffic on port :9000 and :9001 to MinIO and the MinIO Console respectively on the internal network. Set `MINIO_BROWSER_REDIRECT_URL` to `https://console.minio.example.net` to ensure the browser receives a valid reachable URL.
Similarly, if your TLS certificates do not have the IP SAN for the MinIO server host, the MinIO Console may fail to validate the connection to the server. Use the `MINIO_SERVER_URL` environment variable and specify the proxy-accessible hostname of the MinIO server to allow the Console to use the MinIO server API using the TLS certificate.
For example: `export MINIO_SERVER_URL="https://minio.example.net"`
"${MINIO[@]}" server "${WORK_DIR}/fs-disk" >"$WORK_DIR/fs-minio.log" 2>&1&
sleep 10
"${WORK_DIR}/mc" ready verify
}
function start_minio_erasure(){
"${MINIO[@]}" server "${WORK_DIR}/erasure-disk1""${WORK_DIR}/erasure-disk2""${WORK_DIR}/erasure-disk3""${WORK_DIR}/erasure-disk4" >"$WORK_DIR/erasure-minio.log" 2>&1&
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.