the reason for this is to avoid STS mappings to be
purged without a successful load of other policies,
and all the credentials only loaded successfully
are properly handled.
This also avoids unnecessary cache store which was
implemented earlier for optimization.
Directory objects are used by applications that simulate the folder
structure of an on-disk filesystem. These are zero-byte objects with names
ending with '/'. They are only used to check whether a 'folder' exists in
the namespace.
StartSize starts with the raw free space of all disks in the given pool,
however during the status, CurrentSize is not showing the current free
raw space, as expected at least by `mc admin decom status` since it was
written.
Go's net/http is notoriously difficult to have a streaming
deadlines per READ/WRITE on the net.Conn if we add them they
interfere with the Go's internal requirements for a HTTP
connection.
Remove this support for now
fixes#19853
Add partial shard reconstruction
* Add partial shard reconstruction
* Fix padding causing the last shard to be rejected
* Add md5 checks on single parts
* Move md5 verified to `verified/filename.ext`
* Move complete (without md5) to `complete/filename.ext.partno`
It's not pretty, but at least now the md5 gives some confidence it works correctly.
In the very rare case when all drives in a erasure set need to be healed,
remove .healing.bin from all drives, otherwise it will be stuck in a
loop
Also, fix a unit test that fails sometimes due to wrong test.
since #19688 there was a regression introduced during drive
lookups for single node multi-drive setups, drive replacement
would not work correctly without this PR.
This does not fix any current issue, but merging https://github.com/minio/madmin-go/pull/282
can lose the validation of the service account expiration time.
Add more defensive code for now. In the future, we should avoid doing
validation in another library.
precondition check was being honored before, validating
if anonymous access is allowed on the metadata of an
object, leading to metadata disclosure of the following
headers.
```
Last-Modified
Etag
x-amz-version-id
Expires:
Cache-Control:
```
although the information presented is minimal in nature,
and of opaque nature. It still simply discloses that an
object by a specific name exists or not without even having
enough permissions.
fix: authenticate LDAP via actual DN instead of normalized DN
Normalized DN is only for internal representation, not for
external communication, any communication to LDAP must be
based on actual user DN. LDAP servers do not understand
normalized DN.
fixes#19757
This change uses the updated ldap library in minio/pkg (bumped
up to v3). A new config parameter is added for LDAP configuration to
specify extra user attributes to load from the LDAP server and to store
them as additional claims for the user.
A test is added in sts_handlers.go that shows how to access the LDAP
attributes as a claim.
This is in preparation for adding SSH pubkey authentication to MinIO's SFTP
integration.
Add combination of multiple parts.
Parts will be reconstructed and saved separately and can manually be combined to the complete object.
Parts will be named `(version_id)-(filename).(partnum).(in)complete`.
Do not log errors on oneway streams when sending ping fails. Instead, cancel the stream.
This also makes sure pings are sent when blocked on sending responses.
This commit will fix one rare case of a multipart object that
can be read in theory but GetObject API returned an error.
It turned out that a six years old code was marking a drive offline
when the bitrot streaming fails to read a part in a disk with any error.
This can affect reading a subsequent part, though having enough shards,
but unable to construct because one drive was marked offline earlier.
This commit will remove the drive marking offline code. It will also
close the bitrotstreaming reader before marking it as nil.
Adds `-xver` which can be used with `-export` and `-combine` to attempt to combine files across versions if data is suspected to be the same. Overlapping data is compared.
Bonus: Make `inspect` accept wildcards.
Currently, on enabling callhome (or restarting the server), the callhome
job gets scheduled. This means that one has to wait for 24hrs (the
default frequency duration) to see it in action and to figure out if it
is working as expected.
It will be a better user experience to perform the first callhome
execution immediately after enabling it (or on server start if already
enabled).
Also, generate audit event on callhome execution, setting the error
field in case the execution has failed.
* Store ModTime in the upload ID; return it when listing instead of the current time.
* Use this ModTime to expire and skip reading the file info.
* Consistent upload sorting in listing (since it now has the ModTime).
* Exclude healing disks to avoid returning an empty list.
```
==================
WARNING: DATA RACE
Read at 0x0000082be990 by goroutine 205:
github.com/minio/minio/cmd.setCommonHeaders()
Previous write at 0x0000082be990 by main goroutine:
github.com/minio/minio/cmd.lookupConfigs()
```
Recent Veeam is very picky about storage class names. Add `_MINIO_VEEAM_FORCE_SC` env var.
It will override the storage class returned by the storage backend if it is non-standard
and we detect a Veeam client by checking the User Agent.
Applies to HeadObject/GetObject/ListObject*
add deadlines that can be dynamically changed via
the drive max timeout values.
Bonus: optimize "file not found" case and hung drives/network - circuit break the check and return right
away instead of waiting.
Do not log errors on oneway streams when sending ping fails. Instead cancel the stream.
This also makes sure pings are sent when blocked on sending responses.
I will do a separate PR that includes this and adds pings to two-way streams as well as tests for pings.
as that is the only API where the TTFB metric is beneficial, and
capturing this for all APIs exponentially increases the response size in
large clusters.
Replace the `io.Pipe` from streamingBitrotWriter -> CreateFile with a fixed size ring buffer.
This will add an output buffer for encoded shards to be written to disk - potentially via RPC.
This will remove blocking when `(*streamingBitrotWriter).Write` is called, and it writes hashes and data.
With current settings, the write looks like this:
```
Outbound
┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐
│ │ Parr. │ │ (http body) │ │ │ │
│ Bitrot Hash │ Write │ Pipe │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │
│ Erasure Shard │ ──────────► │ (unbuffered) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │
│ │ │ │ │ (io.Copy) │ │ │
└───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘
```
We write a Hash (32 bytes). Since the pipe is unbuffered, it will block until the 32 bytes have
been delivered to the TCP buffer, and the next Read hits the Pipe.
Then we write the shard data. This will typically be bigger than 64KB, so it will block until two blocks
have been read from the pipe.
When we insert a ring buffer:
```
Outbound
┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐
│ │ │ │ (http body) │ │ │ │
│ Bitrot Hash │ Write │ Ring Buffer │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │
│ Erasure Shard │ ──────────► │ (2MB) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │
│ │ │ │ │ (io.Copy) │ │ │
└───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘
```
The hash+shard will fit within the ring buffer, so writes will not block - but will complete after a
memcopy. Reads can fill the 64KB buffer if there is data for it.
If the network is congested, the ring buffer will become filled, and all syscalls will be on full buffers.
Only when the ring buffer is filled will erasure coding start blocking.
Since there is always "space" to write output data, we remove the parallel writing since we are
always writing to memory now, and the goroutine synchronization overhead probably not worth taking.
If the output were blocked in the existing, we would still wait for it to unblock in parallel write, so it would
make no difference there - except now the ring buffer smoothes out the load.
There are some micro-optimizations we could look at later. The biggest is that, in most cases,
we could encode directly to the ring buffer - if we are not at a boundary. Also, "force filling" the
Read requests (i.e., blocking until a full read can be completed) could be investigated and maybe
allow concurrent memory on read and write.
Metrics being added:
- read_tolerance: No of drive failures that can be tolerated without
disrupting read operations
- write_tolerance: No of drive failures that can be tolerated without
disrupting write operations
- read_health: Health of the erasure set in a pool for read operations
(1=healthy, 0=unhealthy)
- write_health: Health of the erasure set in a pool for write operations
(1=healthy, 0=unhealthy)
Adds regression test for #19699
Failures are a bit luck based, since it requires objects to be placed on different sets.
However this generates a failure prior to #19699
* Revert "Revert "Fix incorrect merging of slash-suffixed objects (#19699)""
This reverts commit f30417d9a8.
* Don't override when suffix doesn't match. Instead rely on quorum for each.
Instead of having "online" and "healing" as two metrics, replace with a
single metric "health" which can have following values:
0 = offline
1 = healthy
2 = healing
If two objects share everything but one object has a slash prefix, those would be merged in listings,
with secondary properties used for a tiebreak.
Example: An object with the key `prefix/obj` would be merged with an object named `prefix/obj/`.
While this violates the [no object can be a prefix of another](https://min.io/docs/minio/linux/operations/concepts/thresholds.html#conflicting-objects), let's resolve these.
If we have an object with 'name' and a directory named 'name/' discard the directory only - but allow objects
of 'name' and 'name/' (xldir) to be uniquely returned.
Regression from #15772
canceled callers might linger around longer,
can potentially overwhelm the system. Instead
provider a caller context and canceled callers
don't hold on to them.
Bonus: we have no reason to cache errors, we should
never cache errors otherwise we can potentially have
quorum errors creeping in unexpectedly. We should
let the cache when invalidating hit the actual resources
instead.
LastPong is saved as nanoseconds after a connection or reconnection but
saved as seconds when receiving a pong message. The code deciding if
a pong is too old can be skewed since it assumes LastPong is only in
seconds.
Accept multipart uploads where the combined checksum provides the expected part count.
It seems this was added by AWS to make the API more consistent, even if the
data is entirely superfluous on multiple levels.
Improves AWS S3 compatibility.
This commit adds support for MinKMS. Now, there are three KMS
implementations in `internal/kms`: Builtin, MinIO KES and MinIO KMS.
Adding another KMS integration required some cleanup. In particular:
- Various KMS APIs that haven't been and are not used have been
removed. A lot of the code was broken anyway.
- Metrics are now monitored by the `kms.KMS` itself. For basic
metrics this is simpler than collecting metrics for external
servers. In particular, each KES server returns its own metrics
and no cluster-level view.
- The builtin KMS now uses the same en/decryption implemented by
MinKMS and KES. It still supports decryption of the previous
ciphertext format. It's backwards compatible.
- Data encryption keys now include a master key version since MinKMS
supports multiple versions (~4 billion in total and 10000 concurrent)
per key name.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
If used, 'opts.Marker` will cause many missed entries since results are returned
unsorted, and pools are serialized.
Switch to fully concurrent listing and merging across pools to return sorted entries.
It is expected that whoever is using the credentials which has
the proper set of permissions must be able to run.
`mc support perf object`
While the root login is disabled.
fixes#19648
AWS S3 returns the actual object size as part of XML
response for InvalidRange error, this is used apparently
by SDKs to retry the request without the range.
'opts.Marker` is causing many missed entries if used since results are returned unsorted. Also since pools are serialized.
Switch to do fully concurrent listing and merging across pools to return sorted entries.
Returning errors on listings is impossible with the current API, so document that.
Return an error at once if no drives are found instead of just returning an empty listing and no error.
This is to support deployments migrating from a multi-pooled
wider stripe to lower stripe. MINIO_STORAGE_CLASS_STANDARD
is still expected to be same for all pools. So you can satisfy
adding custom drive count based pools by adjusting the storage
class value.
```
version: v2
address: ':9000'
rootUser: 'minioadmin'
rootPassword: 'minioadmin'
console-address: ':9001'
pools: # Specify the nodes and drives with pools
-
args:
- 'node{11...14}.example.net/data{1...4}'
-
args:
- 'node{15...18}.example.net/data{1...4}'
-
args:
- 'node{19...22}.example.net/data{1...4}'
-
args:
- 'node{23...34}.example.net/data{1...10}'
set-drive-count: 6
```
ILM actions due to ExpiredObjectDeleteAllVersions and
DelMarkerExpiration are ignored when object locking is enabled on a
bucket.
Note: This applies to object versions which may not have retention
configured on them. This applies to all object versions in this bucket,
including those created before the retention config was applied.
Per-bucket metrics endpoints always start with /bucket and the bucket
name is appended to the path. e.g. if the collector path is /bucket/api,
the endpoint for the bucket "mybucket" would be
/minio/metrics/v3/bucket/api/mybucket
Change the existing bucket api endpoint accordingly from /api/bucket to
/bucket/api
The `Token` parameter is a sensitive value that should not be output in the Audit log for STS AssumeRoleWithCustomToken API.
Bonus: Add a simple tool that echoes audit logs to the console.
When listing, with drives returning `errFileNotFound,` `errVolumeNotFound`, or `errUnformattedDisk,`,
we could get below `minDisks` drives being left.
This would result in a quorum never being reachable for any object. Therefore, the listing
would continue, but no results would ever be produced.
Include `fnf` in the mindisk check since it is incremented on these errors. This will stop
listing when minDisks are left.
Allow `opts.minDisks` to not return errVolumeNotFound or errFileNotFound and return that.
That will allow for good results even if disks return something else.
We switch `errUnformattedDisk` to a regular error. If we have enough of those, we should just fail.
Typically not all drives are connected, so we delay 3 minutes before resuming.
This greatly reduces risk of starting to list unconnected drives, or drives we risk being disconnected soon.
This delay is not applied when starting with an admin call.
ConsoleUI like applications rely on combination of
ListServiceAccounts() and InfoServiceAccount() to populate
UI elements, however individually these calls can be slow
causing the entire UI to load sluggishly.
i.e., this rule element doesn't apply to DEL markers.
This is a breaking change to how ExpiredObejctDeleteAllVersions
functions today. This is necessary to avoid the following highly probable
footgun scenario in the future.
Scenario:
The user uses tags-based filtering to select an object's time to live(TTL).
The application sometimes deletes objects, too, making its latest
version a DEL marker. The previous implementation skipped tag-based filters
if the newest version was DEL marker, voiding the tag-based TTL. The user is
surprised to find objects that have expired sooner than expected.
* Add DelMarkerExpiration action
This ILM action removes all versions of an object if its
the latest version is a DEL marker.
```xml
<DelMarkerObjectExpiration>
<Days> 10 </Days>
</DelMarkerObjectExpiration>
```
1. Applies only to objects whose,
• The latest version is a DEL marker.
• satisfies the number of days criteria
2. Deletes all versions of this object
3. Associated rule can't have tag-based filtering
Includes,
- New bucket event type for deletion due to DelMarkerExpiration
calling a remote target remove with a perfectly
well constructed ARN can lead to a crash for a bucket
with no replication configured.
This PR fixes, and adds a crash check for ImportMetadata
as well.
Algorithms are comma separated.
Note that valid values does not in all cases represent default values.
`--sftp=pub-key-algos=...` specifies the supported client public key
authentication algorithms. Note that this doesn't include certificate types
since those use the underlying algorithm. This list is sent to the client if
it supports the server-sig-algs extension. Order is irrelevant.
Valid values
```
ssh-ed25519
sk-ssh-ed25519@openssh.comsk-ecdsa-sha2-nistp256@openssh.com
ecdsa-sha2-nistp256
ecdsa-sha2-nistp384
ecdsa-sha2-nistp521
rsa-sha2-256
rsa-sha2-512
ssh-rsa
ssh-dss
```
`--sftp=kex-algos=...` specifies the supported key-exchange algorithms in preference order.
Valid values:
```
curve25519-sha256
curve25519-sha256@libssh.org
ecdh-sha2-nistp256
ecdh-sha2-nistp384
ecdh-sha2-nistp521
diffie-hellman-group14-sha256
diffie-hellman-group16-sha512
diffie-hellman-group14-sha1
diffie-hellman-group1-sha1
```
`--sftp=cipher-algos=...` specifies the allowed cipher algorithms.
If unspecified then a sensible default is used.
Valid values:
```
aes128-ctr
aes192-ctr
aes256-ctr
aes128-gcm@openssh.comaes256-gcm@openssh.comchacha20-poly1305@openssh.com
arcfour256
arcfour128
arcfour
aes128-cbc
3des-cbc
```
`--sftp=mac-algos=...` specifies a default set of MAC algorithms in preference order.
This is based on RFC 4253, section 6.4, but with hmac-md5 variants removed because they have
reached the end of their useful life.
Valid values:
```
hmac-sha2-256-etm@openssh.comhmac-sha2-512-etm@openssh.com
hmac-sha2-256
hmac-sha2-512
hmac-sha1
hmac-sha1-96
```
This would reduce the size of data in response of metrics
listing. While graphing we can default these metrics with
a zero value if not found.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Unfreeze as soon as the incoming connection is terminated and don't wait for everything to complete.
We don't want to keep the services frozen if something becomes stuck.
- handle errFileCorrupt properly
- micro-optimization of sending done() response quicker
to close the goroutine.
- fix logger.Event() usage in a couple of places
- handle the rest of the client to return a different error other than
lastErr() when the client is closed.
listPathRaw() counts errDiskNotFound as a valid error to indicate a
listing stream end. However, storage.WalkDir() is allowed to return
errDiskNotFound anytime since grid.ErrDisconnected is converted to
errDiskNotFound.
This affects fresh disk healing and should affect S3 listing as well.
endpoint: /minio/metrics/v3/system/process
metrics:
- locks_read_total
- locks_write_total
- cpu_total_seconds
- go_routine_total
- io_rchar_bytes
- io_read_bytes
- io_wchar_bytes
- io_write_bytes
- start_time_seconds
- uptime_seconds
- file_descriptor_limit_total
- file_descriptor_open_total
- syscall_read_total
- syscall_write_total
- resident_memory_bytes
- virtual_memory_bytes
- virtual_memory_max_bytes
Since the standard process collector implements only a subset of these
metrics, remove it and implement our own custom process collector that
captures all the process metrics we need.
Since the object is being permanently deleted, the lack of read quorum should not
matter as long as sufficient disks are online to complete the deletion with parity
requirements.
If several pools have the same object with insufficient read quorum, attempt to
delete object from all the pools where it exists
At server startup, LDAP configuration is validated against the LDAP
server. If the LDAP server is down at that point, we need to cleanly
disable LDAP configuration. Previously, LDAP would remain configured but
error out in strange ways because initialization did not complete
without errors.
When importing access keys (i.e. service accounts) for LDAP accounts,
we are requiring groups to exist under one of the configured group base
DNs. This is not correct. This change fixes this by only checking for
existence and storing the normalized form of the group DN - we do not
return an error if the group is not under a base DN.
Test is updated to illustrate an import failure that would happen
without this change.
Existing IAM import logic for LDAP creates new mappings when the
normalized form of the mapping key differs from the existing mapping key
in storage. This change effectively replaces the existing mapping key by
first deleting it and then recreating with the normalized form of the
mapping key.
For e.g. if an older deployment had a policy mapped to a user DN -
`UID=alice1,OU=people,OU=hwengg,DC=min,DC=io`
instead of adding a mapping for the normalized form -
`uid=alice1,ou=people,ou=hwengg,dc=min,dc=io`
we should replace the existing mapping.
This ensures that duplicates mappings won't remain after the import.
Some additional cleanup cases are also covered. If there are multiple
mappings for the name normalized key such as:
`UID=alice1,OU=people,OU=hwengg,DC=min,DC=io`
`uid=alice1,ou=people,ou=hwengg,DC=min,DC=io`
`uid=alice1,ou=people,ou=hwengg,dc=min,dc=io`
we check if the list of policies mapped to all these keys are exactly
the same, and if so remove all of them and create a single mapping with
the normalized key. However, if the policies mapped to such keys differ,
the import operation returns an error as the server cannot automatically
pick the "right" list of policies to map.
When inspecting files like `.minio.sys/pool.bin` that may be present on multiple sets, use signature to separate them.
Also fixes null versions to actually be useful with `-export -combine`.
`minio_cluster_webhook_queue_length` was wrongly defined as `counter`
where-as it should be `gauge`
Following were wrongly defined as `gauge` when they should actually be
`counter`:
- minio_bucket_replication_sent_bytes
- minio_bucket_replication_received_bytes
- minio_bucket_replication_total_failed_bytes
- minio_bucket_replication_total_failed_count
When LDAP is enabled, previously we were:
- rejecting creation of users and groups via the IAM import functionality
- throwing a `not a valid DN` error when non-LDAP group mappings are present
This change allows for these cases as we need to support situations
where the MinIO server contains users, groups and policy mappings
created before LDAP was enabled.
instead upon any error in renameData(), we still
preserve the existing dataDir in some form for
recoverability in strange situations such as out
of disk space type errors.
Bonus: avoid running list and heal() instead allow
versions disparity to return the actual versions,
uuid to heal. Currently limit this to 100 versions
and lesser disparate objects.
an undo now reverts back the xl.meta from xl.meta.bkp
during overwrites on such flaky setups.
Bonus: Save N depth syscalls via skipping the parents
upon overwrites and versioned updates.
Flaky setup examples are stretch clusters with regular
packet drops etc, we need to add some defensive code
around to avoid dangling objects.
RenameData could start operating on inline data after timing out
and the call returned due to WithDeadline.
This could cause a buffer to write to the inline data being written.
Since no writes are in `RenameData` and the call is canceled,
this doesn't present a corruption issue. But a race is a race and
should be fixed.
Copy inline data to a fresh buffer.
This PR makes a feasible approach to handle all the scenarios
that we must face to avoid returning "panic."
Instead, we must return "errServerNotInitialized" when a
bucketMetadataSys.Get() is called, allowing the caller to
retry their operation and wait.
Bonus fix the way data-usage-cache stores the object.
Instead of storing usage-cache.bin with the bucket as
`.minio.sys/buckets`, the `buckets` must be relative
to the bucket `.minio.sys` as part of the object name.
Otherwise, there is no way to decommission entries at
`.minio.sys/buckets` and their final erasure set positions.
A bucket must never have a `/` in it. Adds code to read()
from existing data-usage.bin upon upgrade.
This PR fixes a few things
- FIPS support for missing for remote transports, causing
MinIO could end up using non-FIPS Ciphers in FIPS mode
- Avoids too many transports, they all do the same thing
to make connection pooling work properly re-use them.
- globalTCPOptions must be set before setting transport
to make sure the client conn deadlines are honored properly.
- GCS warm tier must re-use our transport
- Re-enable trailing headers support.
This reverts commit 928c0181bf.
This change was not correct, reverting.
We track 3 states with the ProxyRequest header - if replication process wants
to know if object is already replicated with a HEAD, it shouldn't proxy back
- Poorna
AWS S3 trailing header support was recently enabled on the warm tier
client connection to MinIO type remote tiers. With this enabled, we are
seeing the following error message at http transport layer.
> Unsolicited response received on idle HTTP channel starting with "HTTP/1.1 400 Bad Request\r\nContent-Type: text/plain; charset=utf-8\r\nConnection: close\r\n\r\n400 Bad Request"; err=<nil>
This is an interim fix until we identify the root cause for this behaviour in the
minio-go client package.
Keep the EC in header, so it can be retrieved easily for dynamic quorum calculations.
To not force a full metadata decode on every read the value will be 0/0 for data written in previous versions.
Size is expected to increase by 2 bytes per version, since all valid values can be represented with 1 byte each.
Example:
```
λ xl-meta xl.meta
{
"Versions": [
{
"Header": {
"EcM": 4,
"EcN": 8,
"Flags": 6,
"ModTime": "2024-04-17T11:46:25.325613+02:00",
"Signature": "0a409875",
"Type": 1,
"VersionID": "8e03504e11234957b2727bc53eda0d55"
},
...
```
Not used for operations yet.
Follow up for #19528
If there are multiple existing DN mappings for the same normalized DN,
if they all have the same policy mapping value, we pick one of them of
them instead of returning an import error.
This is a change to IAM export/import functionality. For LDAP enabled
setups, it performs additional validations:
- for policy mappings on LDAP users and groups, it ensures that the
corresponding user or group DN exists and if so uses a normalized form
of these DNs for storage
- for access keys (service accounts), it updates (i.e. validates
existence and normalizes) the internally stored parent user DN and group
DNs.
This allows for a migration path for setups in which LDAP mappings have
been stored in previous versions of the server, where the name of the
mapping file stored on drives is not in a normalized form.
An administrator needs to execute:
`mc admin iam export ALIAS`
followed by
`mc admin iam import ALIAS /path/to/export/file`
The validations are more strict and returns errors when multiple
mappings are found for the same user/group DN. This is to ensure the
mappings stored by the server are unambiguous and to reduce the
potential for confusion.
Bonus **bug fix**: IAM export of access keys (service accounts) did not
export key name, description and expiration. This is fixed in this
change too.
Reading the list metacache is not protected by a lock; the code retries when it fails
to read the metacache object, however, it forgot to re-read the metacache object
from the drives, which is necessary, especially if the metacache object is inlined.
This commit will ensure that we always re-read the metacache object from the drives
when it is retrying.
When resuming a versioned listing where `version-id-marker=null`, the `null` object would
always be returned, causing duplicate entries to be returned.
Add check against empty version
unlinking() at two different locations on a disk when there
are lots to purge, this can lead to huge IOwaits, instead
rely on rename() to .trash to avoid running multiple unlinks()
in parallel.
since mid 2018 we do not have any deployments
without deployment-id, it is time to put this
code to rest, this PR removes this old code as
its no longer valuable.
on setups with 1000's of drives these are all
quite expensive operations.
The rest of the peer clients were not consistent across nodes. So, meta cache requests
would not go to the same server if a continuation happens on a different node.
As node metrics should be scraped per node basis, use a sample
configuartion using all the nodes in targets.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
When no results match or another error occurs, add an error to the stream. Keep the "inspect-input.txt" as the only thing in the zip for reference.
Example:
```
λ mc support inspect --airgap myminio/testbucket/fjghfjh/**
mc: Using public key from C:\Users\klaus\mc\support_public.pem
File data successfully downloaded as inspect-data.enc
λ inspect inspect-data.enc
Using private key from support_private.pem
output written to inspect-data.zip
2024/04/11 14:10:51 next stream: GetRawData: No files matched the given pattern
λ unzip -l inspect-data.zip
Archive: inspect-data.zip
Length Date Time Name
--------- ---------- ----- ----
222 2024-04-11 14:10 inspect-input.txt
--------- -------
222 1 file
λ
```
Modifies inspect to read until end of stream to report the error.
Bonus: Add legacy commandline params
Add following metrics:
- used_inodes
- total_inodes
- healing
- online
- reads_per_sec
- reads_kb_per_sec
- reads_await
- writes_per_sec
- writes_kb_per_sec
- writes_await
- perc_util
To be able to calculate the `per_sec` values, we capture the IOStats-related
data in the beginning (along with the time at which they were captured),
and compare them against the current values subsequently. This is because
dividing by "time since server uptime." doesn't work in k8s environments.
the disk location never changes in the lifetime of a
MinIO cluster, even if it did validate this close to the
disk instead at the higher layer.
Return appropriate errors indicating an invalid drive, so
that the drive is not recognized as part of a valid
drive.
we have had numerous reports on some config
values not having default values, causing
features misbehaving and not having default
values set properly.
This PR tries to address all these concerns
once and for all.
Each new sub-system that gets added
- must check for invalid keys
- must have default values set
- must not "return err" when being saved into
a global state() instead collate as part of
other subsystem errors allow other sub-systems
to independently initialize.
* Allow specifying the local server, with env variable _MINIO_SERVER_LOCAL, in systems where the hostname cannot be resolved to local IP
* Limit scope of the _MINIO_SERVER_LOCAL solution to only containerized implementations
Return an error when the user specifies endpoints for both source
and target. This can generate many type of errors as the code considers
a deployment remote if its endpoint is specified.
HealObject() does not return an error in some cases, for example, when
an object is successfully reconstructed in one disk but fails with other
disks, another case is when a disk does not have the object is temporarily
disconnected
Add the After heal drives result in the audit output for better
analysis.
Set object's modTime when being restored
restored here refers to making a temporary local copy in the hot tier
for a tiered object using the RestoreObject API
we have been using an LRU caching for internode
auth tokens, migrate to using a typed implementation
and also do not cache auth tokens when its an error.
This fixes a regression from #19358 which prevents policy mappings
created in the latest release from being displayed in policy entity
listing APIs.
This is due to the possibility that the base DNs in the LDAP config are
not in a normalized form and #19358 introduced normalized of mapping
keys (user DNs and group DNs). When listing, we check if the policy
mappings are on entities that parse as valid DNs that are descendants of
the base DNs in the config.
Test added that demonstrates a failure without this fix.
Create new code paths for multiple subsystems in the code. This will
make maintaing this easier later.
Also introduce bugLogIf() for errors that should not happen in the first
place.
Using oidc.redirectUri in the values.yaml only works for the deployment.
When using the statefulset the environment variable
MINIO_IDENTITY_OPENID_REDIRECT_URI is not set. This leads to errors with
oicd providers. For example keycloak throws the error 'invalid
redirect_uri'.
This pull request fixes that.
This commit replaces the `KMS.Stat` API call with a
`KMS.GenerateKey` call. This approach is more reliable
since data key generation also works when the KMS backend
is unavailable (temp. offline), but KES has cached the
key. Ref: KES offline caching.
With this change, it is less likely that MinIO readiness
checks fail in cases where the KMS backend is offline.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
Make sure to pass a nil pointer as a Transport to minio-go when the API config
is not initialized, this will make sure that we do not pass an interface
with a known type but a nil value.
This will also fix the update of the API remote_transport_deadline
configuration without requiring the cluster restart.
Use `ODirectPoolSmall` buffers for inline data in PutObject.
Add a separate call for inline data that will fetch a buffer for the inline data before unmarshal.
This fixes a bug where STS Accounts map accumulates accounts in memory
and never removes expired accounts and the STS Policy mappings were not
being refreshed.
The STS purge routine now runs with every IAM credentials load instead
of every 4th time.
The listing of IAM files is now cached on every IAM load operation to
prevent re-listing for STS accounts purging/reload.
Additionally this change makes each server pick a time for IAM loading
that is randomly distributed from a 10 minute interval - this is to
prevent server from thundering while performing the IAM load.
On average, IAM loading will happen between every 5-15min after the
previous IAM load operation completes.
Fix issue [minio#19314], resolve the absence of the sed command in ubi-micro by replacing it with echo.
Signed-off-by: Andreas Bräu <ab@andi95.de>
Co-authored-by: jiuker <2818723467@qq.com>
If site replication enabled across sites, replicate the SSE-C
objects as well. These objects could be read from target sites
using the same client encryption keys.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
User doesn't need to remember and enter the server values,
rather they can select from the pre populated list.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Instead of relying on user input values, we use the DN value returned by
the LDAP server.
This handles cases like when a mapping is set on a DN value
`uid=svc.algorithm,OU=swengg,DC=min,DC=io` with a user input value (with
unicode variation) of `uid=svc﹒algorithm,OU=swengg,DC=min,DC=io`. The
LDAP server on lookup of this DN returns the normalized value where the
unicode dot character `SMALL FULL STOP` (in the user input), gets
replaced with regular full stop.
Bonus: remove persistent md5sum calculation, turn-off
sha256 as well. Instead we always enable crc32c which
is enough for payload verification also support for
trailing headers checksum.
As total drives count, online vs offline are per node basis, its
corect to select node for which graphs need to be rendered.
Set prometheus scrape jobs to fetch metrics from all nodes. A sample
scrape job for node metrics could be as below
```
- job_name: minio-job-node
bearer_token: <token>
metrics_path: /minio/v2/metrics/node
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: [tenant1-ss-0-0.tenant1-hl.tenant-ns.svc.cluster.local:9000,tenant1-ss-0-1.tenant1-hl.tenant-ns.svc.cluster.local:9000,tenant1-ss-0-2.tenant1-hl.tenant-ns.svc.cluster.local:9000,tenant1-ss-0-3.tenant1-hl.tenant-ns.svc.cluster.local:9000]
```
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Fix races in IAM cache
Fixes#19344
On the top level we only grab a read lock, but we write to the cache if we manage to fetch it.
a03dac41eb/cmd/iam-store.go (L446) is also flipped to what it should be AFAICT.
Change the internal cache structure to a concurrency safe implementation.
Bonus: Also switch grid implementation.
we must attempt to convert all errors at storage-rest-client
into StorageErr() regardless of what functionality is being
called in, this PR fixes this for multiple callers including
some internally used functions.
- old version was unable to retain messages during config reload
- old version could not go from memory to disk during reload
- new version can batch disk queue entries to single for to reduce I/O load
- error logging has been improved, previous version would miss certain errors.
- logic for spawning/despawning additional workers has been adjusted to trigger when half capacity is reached, instead of when the log queue becomes full.
- old version would json marshall x2 and unmarshal 1x for every log item. Now we only do marshal x1 and then we GetRaw from the store and send it without having to re-marshal.
panic seen due to premature closing of slow channel while listing is still sending or
list has already closed on the sender's side:
```
panic: close of closed channel
goroutine 13666 [running]:
github.com/minio/minio/internal/ioutil.SafeClose[...](0x101ff51e4?)
/Users/kp/code/src/github.com/minio/minio/internal/ioutil/ioutil.go:425 +0x24
github.com/minio/minio/cmd.(*erasureServerPools).Walk.func1()
/Users/kp/code/src/github.com/minio/minio/cmd/erasure-server-pool.go:2142 +0x170
created by github.com/minio/minio/cmd.(*erasureServerPools).Walk in goroutine 1189
/Users/kp/code/src/github.com/minio/minio/cmd/erasure-server-pool.go:1985 +0x228
```
Object names of directory objects qualified for ExpiredObjectAllVersions
must be encoded appropriately before calling on deletePrefix on their
erasure set.
e.g., a directory object and regular objects with overlapping prefixes
could lead to the expiration of regular objects, which is not the
intention of ILM.
```
bucket/dir/ ---> directory object
bucket/dir/obj-1
```
When `bucket/dir/` qualifies for expiration, the current implementation would
remove regular objects under the prefix `bucket/dir/`, in this case,
`bucket/dir/obj-1`.
In handlers related to health diagnostics e.g. CPU, Network, Partitions,
etc, globalMinioHost was being passed as the addr, resulting in empty
value for the same in the health report.
Using globalLocalNodeName instead fixes the issue.
IAM loading is a lazy operation, allow these
fallbacks to be in place when we cannot find
in-memory state().
this allows us to honor the request even if pay
a small price for lookup and populating the data.
When objects have more versions than their ILM policy expects to retain
via NewerNoncurrentVersions, but they don't qualify for expiry due to
NoncurrentDays are configured in that rule.
In this case, applyNewerNoncurrentVersionsLimit method was enqueuing empty
tasks, which lead to a panic (panic: runtime error: index out of range [0] with
length 0) in newerNoncurrentTask.OpHash method, which assumes the task
to contain at least one version to expire.
When returning the status of a decommissioned pool, a pool with zero
time StartedTime will be considered an active pool, which is unexpected.
This commit will always ensure that a pool's canceled/failed/completed
status is returned.
This commit changes how MinIO generates the object encryption key (OEK)
when encrypting an object using server-side encryption.
This change is fully backwards compatible. Now, MinIO generates
the OEK as following:
```
Nonce = RANDOM(32) // generate 256 bit random value
OEK = HMAC-SHA256(EK, Context || Nonce)
```
Before, the OEK was computed as following:
```
Nonce = RANDOM(32) // generate 256 bit random value
OEK = SHA256(EK || Nonce)
```
The new scheme does not technically fix a security issue but
uses a more familiar scheme. The only requirement for the
OEK generation function is that it produces a (pseudo)random value
for every pair (`EK`,`Nonce`) as long as no `EK`-`Nonce` combination
is repeated. This prevents a faulty PRNG from repeating or generating
a "bad" key.
The previous scheme guarantees that the `OEK` is a (pseudo)random
value given that no pair (`EK`,`Nonce`) repeats under the assumption
that SHA256 is indistinguable from a random oracle.
The new scheme guarantees that the `OEK` is a (pseudo)random value
given that no pair (`EK`, `Nonce`) repeats under the assumption that
SHA256's underlying compression function is a PRF/PRP.
While the later is a weaker assumption, and therefore, less likely
to be false, both are considered true. SHA256 is believed to be
indistinguable from a random oracle AND its compression function
is assumed to be a PRF/PRP.
As far as the OEK generating is concerned, the OS random number
generator is not required to be pseudo-random but just non-repeating.
Apart from being more compatible to standard definitions and
descriptions for how to generate crypto. keys, this change does not
have any impact of the actual security of the OEK key generation.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
avoids error during upgrades such as
```
API: SYSTEM()
Time: 19:19:22 UTC 03/18/2024
DeploymentID: 24e4b574-b28d-4e94-9bfa-03c363a600c2
Error: Invalid api configuration: found invalid keys (expiry_workers=100 ) for 'api' sub-system, use 'mc admin config reset myminio api' to fix invalid keys (*fmt.wrapError)
11: internal/logger/logger.go:260:logger.LogIf()
...
```
we were prematurely not writing 4k pages while we
could have due to the fact that most buffers would
be multiples of 4k upto some number and there shall
be some remainder.
We only need to write the remainder without O_DIRECT.
at scale customers might start with failed drives,
causing skew in the overall usage ratio per EC set.
make this configurable such that customers can turn
this off as needed depending on how comfortable they
are.
Cosmetic change, but breaks up a big code block and will make a goroutine
dumps of streams are more readable, so it is clearer what each goroutine is doing.
Currently, the code relies on object parity to decide whether it is a
delete marker or a regular object. In the case of a delete marker, the
return quorum is half of the disks in the erasure set. However, this
calculation must be corrected with objects with EC = 0, mainly
because EC is not a one-time fixed configuration.
Though all data are correct, the manifested symptom is a 503 with an
EC=0 object. This bug was manifested after we introduced the
fast Get Object feature that does not read all data from all disks in
case of inlined objects
Metrics v3 is mainly a reorganization of metrics into smaller groups of
metrics and the removal of internal aggregation of metrics received from
peer nodes in a MinIO cluster.
This change adds the endpoint `/minio/metrics/v3` as the top-level metrics
endpoint and under this, various sub-endpoints are implemented. These
are currently documented in `docs/metrics/v3.md`
The handler will serve metrics at any path
`/minio/metrics/v3/PATH`, as follows:
when PATH is a sub-endpoint listed above => serves the group of
metrics under that path; or when PATH is a (non-empty) parent
directory of the sub-endpoints listed above => serves metrics
from each child sub-endpoint of PATH. otherwise, returns a no
resource found error
All available metrics are listed in the `docs/metrics/v3.md`. More will
be added subsequently.
When an object qualifies for both tiering and expiration rules and is
past its expiration date, it should be expired without requiring to tier
it, even when tiering event occurs before expiration.
Merging same-object - multiple versions from different pools would not always result in correct ordering.
When merging keep inputs separate.
```
λ mc ls --versions local/testbucket
------ before ------
[2024-03-05 20:17:19 CET] 228B STANDARD 1f163718-9bc5-4b01-bff7-5d8cf09caf10 v3 PUT hosts
[2024-03-05 20:19:56 CET] 19KiB STANDARD null v2 PUT hosts
[2024-03-05 20:17:15 CET] 228B STANDARD 73c9f651-f023-4566-b012-cc537fdb7ce2 v1 PUT hosts
------ after ------
λ mc ls --versions local/testbucket
[2024-03-05 20:19:56 CET] 19KiB STANDARD null v3 PUT hosts
[2024-03-05 20:17:19 CET] 228B STANDARD 1f163718-9bc5-4b01-bff7-5d8cf09caf10 v2 PUT hosts
[2024-03-05 20:17:15 CET] 228B STANDARD 73c9f651-f023-4566-b012-cc537fdb7ce2 v1 PUT hosts
```
configure batch size to send audit/logger events
in batches instead of sending one event per connection.
this is mainly to optimize the number of requests
we make to webhook endpoint.
Currently, the progress of the batch job is saved in inside the job
request object, which is normally not supported by MinIO. Though there
is no apparent bug, it is better to fix this now.
Batch progress is saved in .minio.sys/batch-jobs/reports/
Co-authored-by: Anis Eleuch <anis@min.io>
our PoolNumber calculation was costly,
while we already had this information per
endpoint, we needed to deduce it appropriately.
This PR addresses this by assigning PoolNumbers
field that carries all the pool numbers that
belong to a server.
properties.PoolNumber still carries a valid value
only when len(properties.PoolNumbers) == 1, otherwise
properties.PoolNumber is set to math.MaxInt (indicating
that this value is undefined) and then one must rely
on properties.PoolNumbers for server participation
in multiple pools.
addresses the issue originating from #11327
simplify audit webhook worker model
fixes couple of bugs like
- ping(ctx) was creating a logger without updating
number of workers leading to incorrect nWorkers
scaling, causing an additional worker that is not
tracked properly.
- h.logCh <- entry could potentially hang for when
the queue is full on heavily loaded systems.
there can be a sudden spike in tiny allocations,
due to too much auditing being done, also don't hang
on the
```
h.logCh <- entry
```
after initializing workers if you do not have a way to
dequeue for some reason.
This commits adds support for using the `--endpoint` arg when creating a
tier of type `azure`. This is needed to connect to Azure's Gov Cloud
instance. For example,
```
mc ilm tier add azure TARGET TIER_NAME \
--account-name ACCOUNT \
--account-key KEY \
--bucket CONTAINER \
--endpoint https://ACCOUNT.blob.core.usgovcloudapi.net
--prefix PREFIX \
--storage-class STORAGE_CLASS
```
Prior to this, the endpoint was hardcoded to `https://ACCOUNT.blob.core.windows.net`.
The docs were even explicit about this, stating that `--endpoint` is:
"Required for `s3` or `minio` tier types. This option has no effect for any
other value of `TIER_TYPE`."
Now, if the endpoint arg is present it will be used. If not, it will
fall back to the same default behavior of `ACCOUNT.blob.core.windows.net`.
Remove api.expiration_workers config setting which was inadvertently left behind. Per review comment
https://github.com/minio/minio/pull/18926, expiration_workers can be configured via ilm.expiration_workers.
ext4, xfs support this behavior however
btrfs, nfs may not support it properly.
in-case when we see Nlink < 2 then we know
that we need to fallback on readdir()
fixes a regression from #19100fixes#19181
The middleware sets up tracing, throttling, gzipped responses and
collecting API stats.
Additionally, this change updates the names of handler functions in
metric labels to be the same as the name derived from Go lang reflection
on the handler name.
The metric api labels are now stored in memory the same as the handler
name - they will be camelcased, e.g. `GetObject` instead of `getobject`.
For compatibility, we lowercase the metric api label values when emitting the metrics.
- Use a shared worker pool for all ILM expiry tasks
- Free version cleanup executes in a separate goroutine
- Add a free version only if removing the remote object fails
- Add ILM expiry metrics to the node namespace
- Move tier journal tasks to expiryState
- Remove unused on-disk journal for tiered objects pending deletion
- Distribute expiry tasks across workers such that the expiry of versions of
the same object serialized
- Ability to resize worker pool without server restart
- Make scaling down of expiryState workers' concurrency safe; Thanks
@klauspost
- Add error logs when expiryState and transition state are not
initialized (yet)
* metrics: Add missed tier journal entry tasks
* Initialize the ILM worker pool after the object layer
With this commit, MinIO generates root credentials automatically
and deterministically if:
- No root credentials have been set.
- A KMS (KES) is configured.
- API access for the root credentials is disabled (lockdown mode).
Before, MinIO defaults to `minioadmin` for both the access and
secret keys. Now, MinIO generates unique root credentials
automatically on startup using the KMS.
Therefore, it uses the KMS HMAC function to generate pseudo-random
values. These values never change as long as the KMS key remains
the same, and the KMS key must continue to exist since all IAM data
is encrypted with it.
Backward compatibility:
This commit should not cause existing deployments to break. It only
changes the root credentials of deployments that have a KMS configured
(KES, not a static key) but have not set any admin credentials. Such
implementations should be rare or not exist at all.
Even if the worst case would be updating root credentials in mc
or other clients used to administer the cluster. Root credentials
are anyway not intended for regular S3 operations.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
just like client-conn-read-deadline, added a new flag that does
client-conn-write-deadline as well.
Both are not configured by default, since we do not yet know
what is the right value. Allow this to be configurable if needed.
we should do this to ensure that we focus on
data healing as primary focus, fixing metadata
as part of healing must be done but making
data available is the main focus.
the main reason is metadata inconsistencies can
cause data availability issues, which must be
avoided at all cost.
will be bringing in an additional healing mechanism
that involves "metadata-only" heal, for now we do
not expect to have these checks.
continuation of #19154
Bonus: add a pro-active healthcheck to perform a connection
This change makes the label names consistent with the handler names.
This is in preparation to use reflection based API handler function
names for the api labels so they will be the same as tracing, auditing
and logging names for these API calls.
Moved different dashboards to their specific directories. Also
mentioned that these dashbards are examples of how to create
graphs using MinIO provided and metrics and customers should
change / add graphs on their specific need basis.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Streams can return errors if the cancelation is picked up before the response
stream close is picked up. Under extreme load, this could lead to missing
responses.
Send server mux ack async so a blocked send cannot block newMuxStream
call. Stream will not progress until mux has been acked.
in k8s things really do come online very asynchronously,
we need to use implementation that allows this randomness.
To facilitate this move WriteAll() as part of the
websocket layer instead.
Bonus: avoid instances of dnscache usage on k8s
New disk healing code skips/expires objects that ILM supposed to expire.
Add more visibility to the user about this activity by calculating those
objects and print it at the end of healing activity.
This PR fixes a bug that perhaps has been long introduced,
with no visible workarounds. In any deployment, if an entire
erasure set is deleted, there is no way the cluster recovers.
Currently, if one object tag matches with one lifecycle tag filter, ILM
will select it, however, this is wrong. All the Tag filters in the
lifecycle document should be satisfied.
This change is to decouple need for root credentials to match between
site replication deployments.
Also ensuring site replication config initialization is re-tried until
it succeeds, this deoendency is critical to STS flow in site replication
scenario.
Currently, we read from `/proc/diskstats` which is found to be
un-reliable in k8s environments. We can read from `sysfs` instead.
Also, cache the latest drive io stats to find the diff and update
the metrics.
* Remove lock for cached operations.
* Rename "Relax" to `ReturnLastGood`.
* Add `CacheError` to allow caching values even on errors.
* Add NoWait that will return current value with async fetching if within 2xTTL.
* Make benchmark somewhat representative.
```
Before: BenchmarkCache-12 16408370 63.12 ns/op 0 B/op
After: BenchmarkCache-12 428282187 2.789 ns/op 0 B/op
```
* Remove `storageRESTClient.scanning`. Nonsensical - RPC clients will not have any idea about scanning.
* Always fetch remote diskinfo metrics and cache them. Seems most calls are requesting metrics.
* Do async fetching of usage caches.
It also fixes a long-standing bug in expiring transitioned objects.
The expiration action was deleting the current version in the case'
of tiered objects instead of adding a delete marker.
only enable md5sum if explicitly asked by the client, otherwise
its not necessary to compute md5sum when SSE-KMS/SSE-C is enabled.
this is continuation of #17958
If network conditions have filled the output queue before a reconnect happens blocked sends could stop reconnects from happening. In short `respMu` would be held for a mux client while sending - if the queue is full this will never get released and closing the mux client will hang.
A) Use the mux client context instead of connection context for sends, so sends are unblocked when the mux client is canceled.
B) Use a `TryLock` on "close" and cancel the request if we cannot get the lock at once. This will unblock any attempts to send.
data-dir not being present is okay, however we can still
rely on the `rename()` atomic call instead of relying on
write xl.meta write which may truncate the io.EOF.
Add a new function logger.Event() to send the log to Console and
http/kafka log webhooks. This will include some internal events such as
disk healing and rebalance/decommissioning
the PR in #16541 was incorrect and hand wrong assumptions
about the overall setup, revert this since this expectation
to have offline servers is wrong and we can end up with a
bigger chicken and egg problem.
This reverts commit 5996c8c4d5.
Bonus:
- preserve disk in globalLocalDrives properly upon connectDisks()
- do not return 'nil' from newXLStorage(), getting it ready for
the next set of changes for 'format.json' loading.
The previous logic of calculating per second values for disk io stats
divides the stats by the host uptime. This doesn't work in k8s
environment as the uptime is of the pod, but the stats (from
/proc/diskstats) are from the host.
Fix this by storing the initial values of uptime and the stats at the
timme of server startup, and using the difference between current and
initial values when calculating the per second values.
globalLocalDrives seem to be not updated during the
HealFormat() leads to a requirement where the server
needs to be restarted for the healing to continue.
a/prefix
a/prefix/1.txt
where `a/prefix` is an object which does not have `/` at the end,
we do not have to aggressively recursively delete all the sub-folders
as well. Instead convert the call into self contained to deleting
'xl.meta' and then subsequently attempting to Remove the parent.
Bonus: enable audit alerts for object versions
beyond the configured value, default is '100'
versions per object beyond which scanner will
alert for each such objects.
when we expand via pools, there is no reason to stick
with the same distributionAlgo as the rest. Since the
algo only makes sense with-in a pool not across pools.
This allows for newer pools to use newer codepaths to
avoid legacy file lookups when they have a pre-existing
deployment from 2019, they can expand their new pool
to be of a newer distribution format, allowing the
pool to be more performant.
Fix reported races that are actually synchronized by network calls.
But this should add some extra safety for untimely disconnects.
Race reported:
```
WARNING: DATA RACE
Read at 0x00c00171c9c0 by goroutine 214:
github.com/minio/minio/internal/grid.(*muxClient).addResponse()
e:/gopath/src/github.com/minio/minio/internal/grid/muxclient.go:519 +0x111
github.com/minio/minio/internal/grid.(*muxClient).error()
e:/gopath/src/github.com/minio/minio/internal/grid/muxclient.go:470 +0x21d
github.com/minio/minio/internal/grid.(*Connection).handleDisconnectClientMux()
e:/gopath/src/github.com/minio/minio/internal/grid/connection.go:1391 +0x15b
github.com/minio/minio/internal/grid.(*Connection).handleMsg()
e:/gopath/src/github.com/minio/minio/internal/grid/connection.go:1190 +0x1ab
github.com/minio/minio/internal/grid.(*Connection).handleMessages.func1()
e:/gopath/src/github.com/minio/minio/internal/grid/connection.go:981 +0x610
Previous write at 0x00c00171c9c0 by goroutine 1081:
github.com/minio/minio/internal/grid.(*muxClient).roundtrip()
e:/gopath/src/github.com/minio/minio/internal/grid/muxclient.go:94 +0x324
github.com/minio/minio/internal/grid.(*muxClient).traceRoundtrip()
e:/gopath/src/github.com/minio/minio/internal/grid/trace.go:74 +0x10e4
github.com/minio/minio/internal/grid.(*Subroute).Request()
e:/gopath/src/github.com/minio/minio/internal/grid/connection.go:366 +0x230
github.com/minio/minio/internal/grid.(*SingleHandler[go.shape.*github.com/minio/minio/cmd.DiskInfoOptions,go.shape.*github.com/minio/minio/cmd.DiskInfo]).Call()
e:/gopath/src/github.com/minio/minio/internal/grid/handlers.go:554 +0x3fd
github.com/minio/minio/cmd.(*storageRESTClient).DiskInfo()
e:/gopath/src/github.com/minio/minio/cmd/storage-rest-client.go:314 +0x270
github.com/minio/minio/cmd.erasureObjects.getOnlineDisksWithHealingAndInfo.func1()
e:/gopath/src/github.com/minio/minio/cmd/erasure.go:293 +0x171
```
This read will always happen after the write, since there is a network call in between.
However a disconnect could come in while we are setting up the call, so we protect against that with extra checks.
- bucket metadata does not need to look for legacy things
anymore if b.Created is non-zero
- stagger bucket metadata loads across lots of nodes to
avoid the current thundering herd problem.
- Remove deadlines for RenameData, RenameFile - these
calls should not ever be timed out and should wait
until completion or wait for client timeout. Do not
choose timeouts for applications during the WRITE phase.
- increase R/W buffer size, increase maxMergeMessages to 30
We have observed cases where a blocked stream will block for cancellations.
This happens when response channel is blocked and we want to push an error.
This will have the response mutex locked, which will prevent all other operations until upstream is unblocked.
Make this behavior non-blocking and if blocked spawn a goroutine that will send the response and close the output.
Still a lot of "dancing". Added a test for this and reviewed.
Depending on when the context cancelation is picked up the handler may return and close the channel before `SubscribeJSON` returns, causing:
```
Feb 05 17:12:00 s3-us-node11 minio[3973657]: panic: send on closed channel
Feb 05 17:12:00 s3-us-node11 minio[3973657]: goroutine 378007076 [running]:
Feb 05 17:12:00 s3-us-node11 minio[3973657]: github.com/minio/minio/internal/pubsub.(*PubSub[...]).SubscribeJSON.func1()
Feb 05 17:12:00 s3-us-node11 minio[3973657]: github.com/minio/minio/internal/pubsub/pubsub.go:139 +0x12d
Feb 05 17:12:00 s3-us-node11 minio[3973657]: created by github.com/minio/minio/internal/pubsub.(*PubSub[...]).SubscribeJSON in goroutine 378010884
Feb 05 17:12:00 s3-us-node11 minio[3973657]: github.com/minio/minio/internal/pubsub/pubsub.go:124 +0x352
```
Wait explicitly for the goroutine to exit.
Bonus: Listen for doneCh when sending to not risk getting blocked there is channel isn't being emptied.
this fixes rare bugs we have seen but never really found a
reproducer for
- PutObjectRetention() returning 503s
- PutObjectTags() returning 503s
- PutObjectMetadata() updates during replication returning 503s
These calls return errors, and this perpetuates with
no apparent fix.
This PR fixes with correct quorum requirement.
To force limit the duration of STS accounts, the user can create a new
policy, like the following:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["sts:AssumeRoleWithWebIdentity"],
"Condition": {"NumericLessThanEquals": {"sts:DurationSeconds": "300"}}
}]
}
And force binding the policy to all OpenID users, whether using a claim name or role
ARN.
disk tokens usage is not necessary anymore with the implementation
of deadlines for storage calls and active monitoring of the drive
for I/O timeouts.
Functionality kicking off a bad drive is still supported, it's just that
we do not have to serialize I/O in the manner tokens would do.
Recycle would always be called on the dummy value `any(newRT())` instead of the actual value given to the recycle function.
Caught by race tests, but mostly harmless, except for reduced perf.
Other minor cleanups. Introduced in #18940 (unreleased)
Interpret `null` inline policy for access keys as inheriting parent
policy. Since MinIO Console currently sends this value, we need to honor it
for now. A larger fix in Console and in the server are required.
Fixes#18939.
Allow internal types to support a `Recycler` interface, which will allow for sharing of common types across handlers.
This means that all `grid.MSS` (and similar) objects are shared across in a common pool instead of a per-handler pool.
Add internal request reuse of internal types. Add for safe (pointerless) types explicitly.
Only log params for internal types. Doing Sprint(obj) is just a bit too messy.
With this change, only a user with `UpdateServiceAccountAdminAction`
permission is able to edit access keys.
We would like to let a user edit their own access keys, however the
feature needs to be re-designed for better security and integration with
external systems like AD/LDAP and OpenID.
This change prevents privilege escalation via service accounts.
for actionable, inspections we have `mc support inspect`
we do not need double logging, healing will report relevant
errors if any, in terms of quorum lost etc.
Each Put, List, Multipart operations heavily rely on making
GetBucketInfo() call to verify if bucket exists or not on
a regular basis. This has a large performance cost when there
are tons of servers involved.
We did optimize this part by vectorizing the bucket calls,
however its not enough, beyond 100 nodes and this becomes
fairly visible in terms of performance.
- healing must not set the write xattr
because that is the job of active healing
to update. what we need to preserve is
permanent deletes.
- remove older env for drive monitoring and
enable it accordingly, as a global value.
local disk metrics were polluting cluster metrics
Please remove them instead of adding relevant ones.
- batch job metrics were incorrectly kept at bucket
metrics endpoint, move it to cluster metrics.
- add tier metrics to cluster peer metrics from the node.
- fix missing set level cluster health metrics
Do not rely on `connChange` to do reconnects.
Instead, you can block while the connection is running and reconnect
when handleMessages returns.
Add fully async monitoring instead of monitoring on the main goroutine
and keep this to avoid full network lockup.
it is entirely possible that a rebalance process which was running
when it was asked to "stop" it failed to write its last statistics
to the disk.
After this a pool expansion can cause disruption and all S3 API
calls would fail at IsPoolRebalancing() function.
This PRs makes sure that we update rebalance.bin under such
conditions to avoid any runtime crashes.
add new update v2 that updates per node, allows idempotent behavior
new API ensures that
- binary is correct and can be downloaded checksummed verified
- committed to actual path
- restart returns back the relevant waiting drives
do not need to be defensive in our approach,
we should simply override anything everything
in import process, do not care about what
currently exists on the disk - backup is the
source of truth.
Right now the format.json is excluded if anything within `.minio.sys` is requested.
I assume the check was meant to exclude only if it was actually requesting it.
- Move RenameFile to websockets
- Move ReadAll that is primarily is used
for reading 'format.json' to to websockets
- Optimize DiskInfo calls, and provide a way
to make a NoOp DiskInfo call.
Add separate reconnection mutex
Give more safety around reconnects and make sure a state change isn't missed.
Tested with several runs of `λ go test -race -v -count=500`
Adds separate mutex and doesn't mix in the testing mutex.
AlmosAll uses of NewDeadlineWorker, which relied on secondary values, were used in a racy fashion,
which could lead to inconsistent errors/data being returned. It also propagates the deadline downstream.
Rewrite all these to use a generic WithDeadline caller that can return an error alongside a value.
Remove the stateful aspect of DeadlineWorker - it was racy if used - but it wasn't AFAICT.
Fixes races like:
```
WARNING: DATA RACE
Read at 0x00c130b29d10 by goroutine 470237:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:702 +0x611
github.com/minio/minio/cmd.readFileInfo()
github.com/minio/minio/cmd/erasure-metadata-utils.go:160 +0x122
github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.1()
github.com/minio/minio/cmd/erasure-object.go:809 +0x27a
github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.2()
github.com/minio/minio/cmd/erasure-object.go:828 +0x61
Previous write at 0x00c130b29d10 by goroutine 470298:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion.func1()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:698 +0x244
github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33
WARNING: DATA RACE
Write at 0x00c0ba6e6c00 by goroutine 94507:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol.func1()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:419 +0x104
github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33
Previous read at 0x00c0ba6e6c00 by goroutine 94463:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:422 +0x47e
github.com/minio/minio/cmd.getBucketInfoLocal.func1()
github.com/minio/minio/cmd/peer-s3-server.go:275 +0x122
github.com/minio/pkg/v2/sync/errgroup.(*Group).Go.func1()
```
Probably back from #17701
protection was in place. However, it covered only some
areas, so we re-arranged the code to ensure we could hold
locks properly.
Along with this, remove the DataShardFix code altogether,
in deployments with many drive replacements, this can affect
and lead to quorum loss.
Also limit the amount of concurrency when sending
binary updates to peers, avoid high network over
TX that can cause disconnection events for the
node sending updates.
Race checks would occasionally show race on handleMsgWg WaitGroup by debug messages (used in test only).
Use the `connMu` mutex to protect this against concurrent Wait/Add.
Fixes#18827
If site replication is enabled, we should still show the size and
version distribution histogram metrics at bucket level.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
New API now verifies any hung disks before restart/stop,
provides a 'per node' break down of the restart/stop results.
Provides also how many blocked syscalls are present on the
drives and what users must do about them.
Adds options to do pre-flight checks to provide information
to the user regarding any hung disks. Provides 'force' option
to forcibly attempt a restart() even with waiting syscalls
on the drives.
When rejecting incoming grid requests fill out the rejection reason and log it once.
This will give more context when startup is failing. Already logged after a retry on caller.
On a policy detach operation, if there are no policies remaining
attached to the user/group, remove the policy mapping file, instead of
leaving a file containing an empty list of policies.
Healing dangling buckets is conservative, and it is a typical use case to
fail to remove a dangling bucket because it contains some data because
healing danging bucket code is not allowed to remove data: only healing
the dangling object is allowed to do so.
reference format is constant for any lifetime of
a minio cluster, we do not have to ever replace
it during HealFormat() as it will never change.
additionally we should simply reject reference
formats that we do not understand early on.
GetActualSize() was heavily relying on o.Parts()
to be non-empty to figure out if the object is multipart or not,
However, we have many indicators of whether an object is multipart
or not.
Blindly assuming that o.Parts == nil is not a multipart, is an
incorrect expectation instead, multipart must be obtained via
- Stored metadata value indicating this is a multipart encrypted object.
- Rely on <meta>-actual-size metadata to get the object's actual size.
This value is preserved for additional reasons such as these.
- ETag != 32 length
support proxying of tagging requests in active-active replication
Note: even if proxying is successful, PutObjectTagging/DeleteObjectTagging
will continue to report a 404 since the object is not present locally.
New intervals:
[1024B, 64KiB)
[64KiB, 256KiB)
[256KiB, 512KiB)
[512KiB, 1MiB)
The new intervals helps us see object size distribution with higher
resolution for the interval [1024B, 1MiB).
- HealFormat() was leaking healthcheck goroutines for
disks, we are only interested in enabling healthcheck
for the newly formatted disk, not for existing disks.
- When disk is a root-disk a random disk monitor was
leaking while we ignored the drive.
- When loading the disk for each erasure set, we were
leaking goroutines for the prepare-storage.go disks
which were replaced via the globalLocalDrives slice
- avoid disk monitoring utilizing health tokens that
would cause exhaustion in the tokens, prematurely
which were meant for incoming I/O. This is ensured
by avoiding writing O_DIRECT aligned buffer instead
write 2048 worth of content only as O_DSYNC, which is
sufficient.
Add a hidden configuration under the scanner sub section to configure if
the scanner should sleep between two objects scan. The configuration has
only effect when there is no drive activity related to s3 requests or
healing.
By default, the code will keep the current behavior which is doing
sleep between objects.
To forcefully enable the full scan speed in idle mode, you can do this:
`mc admin config set myminio scanner idle_speed=full`
fixes#18724
A regression was introduced in #18547, that attempted
to file adding a missing `null` marker however we
should not skip returning based on versionID instead
it must be based on if we are being asked to create
a DEL marker or not.
The PR also has a side-affect for replicating `null`
marker permanent delete, as it may end up adding a
`null` marker while removing one.
This PR should address both scenarios.
NOTE: This feature is not retro-active; it will not cater to previous transactions
on existing setups.
To enable this feature, please set ` _MINIO_DRIVE_QUORUM=on` environment
variable as part of systemd service or k8s configmap.
Once this has been enabled, you need to also set `list_quorum`.
```
~ mc admin config set alias/ api list_quorum=auto`
```
A new debugging tool is available to check for any missing counters.
Following policies if present
```
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"54.240.143.0/24",
"2001:DB8:1234:5678::/64"
]
}
}
```
And client is making a request to MinIO via IPv6 can
potentially crash the server.
Workarounds are turn-off IPv6 and use only IPv4
This PR also increases per node bpool memory from 1024 entries
to 2048 entries; along with that, it also moves the byte pool
centrally instead of being per pool.
minio_node_tier_ttlb_seconds - Distribution of time to last byte for streaming objects from warm tier
minio_node_tier_requests_success - Number of requests to download object from warm tier that were successful
minio_node_tier_requests_failure - Number of requests to download object from warm tier that failed
SUBNET now has a v2 of license that is returned in the new key
`license_v2`. mc will start reading and storing the same. (The old key
`license` is deprecated but is still available in SUBNET response to
ensure that the current released version of minio doesn't break)
`(*xlStorageDiskIDCheck).CreateFile` wraps the incoming reader in `xioutil.NewDeadlineReader`.
The wrapped reader is handed to `(*xlStorage).CreateFile`. This performs a Read call via `writeAllDirect`,
which reads into an `ODirectPool` buffer.
`(*DeadlineReader).Read` spawns an async read into the buffer. If a timeout is hit while reading,
the read operation returns to `writeAllDirect`. The operation returns an error and the buffer is reused.
However, if the async `Read` call unblocks, it will write to the now recycled buffer.
Fix: Remove the `DeadlineReader` - it is inherently unsafe. Instead, rely on the network timeouts.
This is not a disk timeout, anyway.
Regression in https://github.com/minio/minio/pull/17745
This patch adds the targetID to the existing notification target metrics
and deprecates the current target metrics which points to the overall
event notification subsystem
historically, we have always kept storage-rest-server
and a local storage API separate without much trouble,
since they both can independently operate due to no
special state() between them.
however, over some time, we have added state()
such as
- drive monitoring threads now there will be "2" of
them per drive instead of just 1.
- concurrent tokens available per drive are now twice
instead of just single shared, allowing unexpectedly
high amount of I/O to go through.
- applying serialization by using walkMutexes can now
be adequately honored for both remote callers and local
callers.
Regression from #18285. CopyObject options were inheriting source MTime
for metadata timestamps if unspecified, removing this prevented metadata
updates from being applied on target.
The metrics `minio_bucket_replication_received_bytes` and
`minio_bucket_replication_sent_bytes` are additive in nature
and rendering the value as is looks fine.
Also added sort order for few graphs for better reading of tool
tips as keeping ones with highest value at top helps.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
By default the cpu load is the cumulative of all cores. Capture the
percentage load (load * 100 / cpu-count)
Also capture the percentage memory used (used * 100 / total)
use memory for async events when necessary and dequeue them as
needed, for all synchronous events customers must enable
```
MINIO_API_SYNC_EVENTS=on
```
Async events can be lost but is upto to the admin to
decide what they want, we will not create run-away number
of goroutines per event instead we will queue them properly.
Currently the max async workers is set to runtime.GOMAXPROCS(0)
which is more than sufficient in general, but it can be made
configurable in future but may not be needed.
there is potential for danglingWrites when quorum failed, where
only some drives took a successful write, generally this is left
to the healing routine to pick it up. However it is better that
we delete it right away to avoid potential for quorum issues on
version signature when there are many versions of an object.
it is okay if the warm-tier cannot keep up, we should continue
to take I/O at hot-tier, only fail hot-tier or block it when
we are disk full.
Bonus: add metrics counter for these missed tasks, we will
know for sure if one of the node is lagging behind or is
losing too many tasks during transitioning.
A disk that is not able to initialize when an instance is started
will never have a handler registered, which means a user will
need to restart the node after fixing the disk;
This will also prevent showing the wrong 'upgrade is needed.'
error message in that case.
When the disk is still failing, print an error every 30 minutes;
Disk reconnection will be retried every 30 seconds.
Co-authored-by: Anis Elleuch <anis@min.io>
`OpMuxConnectError` was not handled correctly.
Remove local checks for single request handlers so they can
run before being registered locally.
Bonus: Only log IAM bootstrap on startup.
```
using deb packager...
created package: minio-release/linux-amd64/minio_20231120224007.0.0.hotfix.e96ac7272_amd64.deb
using rpm packager...
created package: minio-release/linux-amd64/minio-20231120224007.0.0.hotfix.e96ac7272-1.x86_64.rpm
```
While healing the latest changes of expiry rules across sites
if target had pre existing transition rules, they were getting
overwritten as cloned latest expiry rules from remote site were
getting written as is. Fixed the same and added test cases as
well.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
moveToTrash() function moves a folder to .trash, for example, when
doing some object deletions: a data dir that has many parts will be
renamed to the trash folder; However, ENOSPC is a valid error from
rename(), and it can cripple a user trying to free some space in an
entire disk situation.
Therefore, this commit will try to do a recursive delete in that case.
This allows batch replication to basically do not
attempt to copy objects that do not have read quorum.
This PR also allows walk() to provide custom
values for quorum under batch replication, and
key rotation.
this PR allows following policy
```
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Deny a presigned URL request if the signature is more than 10 min old",
"Effect": "Deny",
"Action": "s3:*",
"Resource": "arn:aws:s3:::DOC-EXAMPLE-BUCKET1/*",
"Condition": {
"NumericGreaterThan": {
"s3:signatureAge": 600000
}
}
}
]
}
```
This is to basically disable all pre-signed URLs that are older than 10 minutes.
AWS S3 closes keep-alive connections frequently
leading to frivolous logs filling up the MinIO
logs when the transition tier is an AWS S3 bucket.
Ignore such transient errors, let MinIO retry
it when it can.
When minio runs with MINIO_CI_CD=on, it is expected to communicate
with the locally running SUBNET. This is happening in the case of MinIO
via call home functionality. However, the subnet-related functionality inside the
console continues to talk to the SUBNET production URL. Because of this,
the console cannot be tested with a locally running SUBNET.
Set the env variable CONSOLE_SUBNET_URL correctly in such cases.
(The console already has code to use the value of this variable
as the subnet URL)
Optionally allows customers to enable
- Enable an external cache to catch GET/HEAD responses
- Enable skipping disks that are slow to respond in GET/HEAD
when we have already achieved a quorum
Bonus: allow replication to attempt Deletes/Puts when
the remote returns quorum errors of some kind, this is
to ensure that MinIO can rewrite the namespace with the
latest version that exists on the source.
This PR adds a WebSocket grid feature that allows servers to communicate via
a single two-way connection.
There are two request types:
* Single requests, which are `[]byte => ([]byte, error)`. This is for efficient small
roundtrips with small payloads.
* Streaming requests which are `[]byte, chan []byte => chan []byte (and error)`,
which allows for different combinations of full two-way streams with an initial payload.
Only a single stream is created between two machines - and there is, as such, no
server/client relation since both sides can initiate and handle requests. Which server
initiates the request is decided deterministically on the server names.
Requests are made through a mux client and server, which handles message
passing, congestion, cancelation, timeouts, etc.
If a connection is lost, all requests are canceled, and the calling server will try
to reconnect. Registered handlers can operate directly on byte
slices or use a higher-level generics abstraction.
There is no versioning of handlers/clients, and incompatible changes should
be handled by adding new handlers.
The request path can be changed to a new one for any protocol changes.
First, all servers create a "Manager." The manager must know its address
as well as all remote addresses. This will manage all connections.
To get a connection to any remote, ask the manager to provide it given
the remote address using.
```
func (m *Manager) Connection(host string) *Connection
```
All serverside handlers must also be registered on the manager. This will
make sure that all incoming requests are served. The number of in-flight
requests and responses must also be given for streaming requests.
The "Connection" returned manages the mux-clients. Requests issued
to the connection will be sent to the remote.
* `func (c *Connection) Request(ctx context.Context, h HandlerID, req []byte) ([]byte, error)`
performs a single request and returns the result. Any deadline provided on the request is
forwarded to the server, and canceling the context will make the function return at once.
* `func (c *Connection) NewStream(ctx context.Context, h HandlerID, payload []byte) (st *Stream, err error)`
will initiate a remote call and send the initial payload.
```Go
// A Stream is a two-way stream.
// All responses *must* be read by the caller.
// If the call is canceled through the context,
//The appropriate error will be returned.
type Stream struct {
// Responses from the remote server.
// Channel will be closed after an error or when the remote closes.
// All responses *must* be read by the caller until either an error is returned or the channel is closed.
// Canceling the context will cause the context cancellation error to be returned.
Responses <-chan Response
// Requests sent to the server.
// If the handler is defined with 0 incoming capacity this will be nil.
// Channel *must* be closed to signal the end of the stream.
// If the request context is canceled, the stream will no longer process requests.
Requests chan<- []byte
}
type Response struct {
Msg []byte
Err error
}
```
There are generic versions of the server/client handlers that allow the use of type
safe implementations for data types that support msgpack marshal/unmarshal.
With an odd number of drives per erasure set setup, the write/quorum is
the half + 1; however the decommissioning listing will still list those
objects and does not consider those as stale.
Fix it by using (N+1)/2 formula.
Co-authored-by: Anis Elleuch <anis@min.io>
Immediate transition use case and is mostly used to fill warm
backend with a lot of data when a new deployment is created
Currently, if the transition queue is complete, the transition will be
deferred to the scanner; change this behavior by blocking the PUT request
until the transition queue has a new place for a transition task.
Currently, once the audit becomes offline, there is no code that tries
to reconnect to the audit, at the same time Send() quickly returns with
an error without really trying to send a message the audit endpoint; so
the audit endpoint will never be online again.
Fixing this behavior; the current downside is that we miss printing some
logs when the audit becomes offline; however this information is
available in prometheus
Later, we can refactor internal/logger so the http endpoint can send errors to
console target.
Currently if the object does not exist in quorum disks of an erasure
set, the dangling code is never called because the returned error will
be errFileNotFound or errFileVersionNotFound;
With this commit, when errFileNotFound or errFileVersionNotFound is
returning when trying to calculate the quorum of a given object, the
code checks if a disk returned nil, which means a stale object exists in
that disk, that will trigger deleteIfDangling() function
This commit splits the liveness and readiness
handler into two separate handlers. In K8S, a
liveness probe is used to determine whether the
pod is in "live" state and functioning at all.
In contrast, the readiness probe is used to
determine whether the pod is ready to serve
requests.
A failing liveness probe causes pod restarts while
a failing readiness probe causes k8s to stop routing
traffic to the pod. Hence, a liveness probe should
be as robust as possible while a readiness probe
should be used to load balancing.
Ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
Signed-off-by: Andreas Auernhammer <github@aead.dev>
This patch takes care of loading the bucket configs of failed buckets
during the periodic refresh. This makes sure the event notifiers and
remote bucket targets are properly initialized.
users might use MinIO on NFS, GPFS that provide dynamic
inodes and may not even have a concept of free inodes.
to allow users to use MinIO on top of GPFS relax the
free inode check.
* creating a byte buffer for SFTP file segments
* Adding an error condition for when there are
remaining segments in the queue
* Simplification of the queue using a map
it is possible that ILM or Deletes got triggered on batch
of objects that we are attempting to batch replicate, ignore
this scenario as valid behavior.
sendfile implementation to perform DMA on all platforms
Go stdlib already supports sendfile/splice implementations
for
- Linux
- Windows
- *BSD
- Solaris
Along with this change however O_DIRECT for reads() must be
removed as well since we need to use sendfile() implementation
The main reason to add O_DIRECT for reads was to reduce the
chances of page-cache causing OOMs for MinIO, however it would
seem that avoiding buffer copies from user-space to kernel space
this issue is not a problem anymore.
There is no Go based memory allocation required, and neither
the page-cache is referenced back to MinIO. This page-
cache reference is fully owned by kernel at this point, this
essentially should solve the problem of page-cache build up.
With this now we also support SG - when NIC supports Scatter/Gather
https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing)
`monitorAndConnectEndpoints` will continue to attempt to reconnect offline disks.
Since disks were never closed, a `MarkOffline` would continue to try to check these disks forever.
Close previous disks.
replace io.Discard usage to fix NUMA copy() latencies
On NUMA systems copying from 8K buffer allocated via
io.Discard leads to large latency build-up for every
```
copy(new8kbuf, largebuf)
```
can in-cur upto 1ms worth of latencies on NUMA systems
due to memory sharding across NUMA nodes.
Fix various regressions from #18029
* If context is canceled the token is never returned. This will lead to scanner being unable to save and deadlocking.
* Fix backup not being able to get any data (hr empty)
* Reduce backup timeout.
Tiering statistics have been broken for some time now, a regression
was introduced in 6f2406b0b6
Bonus fixes an issue where the objects are not assumed to be
of the 'STANDARD' storage-class for the objects that have
not yet tiered, this should be conditional based on the object's
metadata not a default assumption.
This PR also does some cleanup in terms of implementation,
fixes#18070
https://github.com/minio/minio/pull/18307 partially removed the duplicate upload id check.
While I can't really see how ListDir can return duplicate entries, let's re-add it, since it is a cheap sanity check.
This commit changes the container base image
from ubi-minimal to ubi-micro.
The docker build process happens now in two stages.
The build stage:
- downloads the latest CA certificate bundle
- downloads MinIO binary (for requested version/os/arch)
- downloads MinIO binary signature and verifies it
using minisign
Then it creates an image based on ubi-micro with just
the minio binary was downloaded and verified during the
build stage.
The build stage is simplified to just verifying the
minisign signature.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
There can be rare situations where errors seen in bucket metadata
load on startup or subsequent metadata updates can result in missing
replication remotes.
Attempt a refresh of remote targets backed by a good replication config
lazily in 5 minute intervals if there ever occurs a situation where
remote targets go AWOL.
resync status may not be upto-date by
the time the resync is over due to how
the timer is triggered.
diff is sufficient to know if replication
happened or not.
`GetParityForSC` has a value receiver, so Config is copied before the lock is obtained.
Make it pointer receiver.
Fixes:
```
WARNING: DATA RACE
Read at 0x0000079cdd10 by goroutine 190:
github.com/minio/minio/cmd.(*erasureServerPools).BackendInfo()
github.com/minio/minio/cmd/erasure-server-pool.go:579 +0x6f
github.com/minio/minio/cmd.(*erasureServerPools).LocalStorageInfo()
github.com/minio/minio/cmd/erasure-server-pool.go:614 +0x3c6
github.com/minio/minio/cmd.(*peerRESTServer).LocalStorageInfoHandler()
github.com/minio/minio/cmd/peer-rest-server.go:347 +0x4ea
github.com/minio/minio/cmd.(*peerRESTServer).LocalStorageInfoHandler-fm()
...
WARNING: DATA RACE
Read at 0x0000079cdd10 by goroutine 190:
github.com/minio/minio/cmd.(*erasureServerPools).BackendInfo()
github.com/minio/minio/cmd/erasure-server-pool.go:579 +0x6f
github.com/minio/minio/cmd.(*erasureServerPools).LocalStorageInfo()
github.com/minio/minio/cmd/erasure-server-pool.go:614 +0x3c6
github.com/minio/minio/cmd.(*peerRESTServer).LocalStorageInfoHandler()
github.com/minio/minio/cmd/peer-rest-server.go:347 +0x4ea
github.com/minio/minio/cmd.(*peerRESTServer).LocalStorageInfoHandler-fm()
```
Since relaxing quorum the error across pools
for ListBuckets(), GetBucketInfo() we hit a
situation where loading IAM could potentially
return an error for second pool that server
is not initialized.
We need to handle this, let the pool come online
and retry transparently - this PR fixes that.
x-amz-signed-headers is meant for HTTP headers only
not for query params, using that to verify things
further can lead to failure.
The generated presigned URL with custom metadata
is already kosher (tamper proof).
fixes#18281
`resourceMetricsMap` has no protection against concurrent reads and writes.
Add a mutex and don't use maps from the last iteration.
Bug introduced in #18057Fixes#18271
globalDeploymentID was being read while it was being set.
Fixes race:
```
WARNING: DATA RACE
Write at 0x0000079605a0 by main goroutine:
github.com/minio/minio/cmd.connectLoadInitFormats()
github.com/minio/minio/cmd/prepare-storage.go:269 +0x14f0
github.com/minio/minio/cmd.waitForFormatErasure()
github.com/minio/minio/cmd/prepare-storage.go:294 +0x21d
...
Previous read at 0x0000079605a0 by goroutine 105:
github.com/minio/minio/cmd.newContext()
github.com/minio/minio/cmd/utils.go:817 +0x31e
github.com/minio/minio/cmd.adminMiddleware.func1()
github.com/minio/minio/cmd/admin-router.go:110 +0x96
net/http.HandlerFunc.ServeHTTP()
net/http/server.go:2136 +0x47
github.com/minio/minio/cmd.setBucketForwardingMiddleware.func1()
github.com/minio/minio/cmd/generic-handlers.go:460 +0xb1a
net/http.HandlerFunc.ServeHTTP()
net/http/server.go:2136 +0x47
...
```
currently the default for all drives is 512, which is a lot
for HDDs the recent testing has revealed moving this to 32
for HDDs seems like a fair value.
Introducing a new version of healthinfo struct for adding this info is
not correct. It needs to be implemented differently without adding a new
version.
This reverts commit 8737025d940f80360ed4b3686b332db5156f6659.
There is a fundamental race condition in `newErasureServerPools`, where setObjectLayer is
called before the poolMeta has been loaded/populated.
We add a placeholder value to this field but disable all saving of the value, so we don't risk
overwriting the value on disk. Once the value has been loaded or created, it is replaced with
the proper value, which will also be saved.
Also fixes various accesses of `poolMeta` that were done without locks.
We make the `poolMeta.IsSuspended` return false, even if we shouldn't risk out-of-bounds
reads anymore.
If target went offline while MinIO was down, error once
while trying to send message. If target goes offline during
MinIO server running, it already comes through ping() call
and errors out if target offline.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
if erasure upgrade is needed rely on the in-memory
values, instead of performing a "DiskInfo()" call.
https://brendangregg.com/blog/2016-09-03/sudden-disk-busy.html
for HDDs these are problematic, lets avoid this because
there is no value in "being" absolutely strict here
in terms of parity. We are okay to increase parity
as we see based on the in-memory online/offline ratio.
Several callers to putObjectTar may be fighting to set sc. Move the write out of the loop.
Use static resp, and request elements.
Fixes tests with -race:
```
WARNING: DATA RACE
Read at 0x00c01cd680e0 by goroutine 691354:
github.com/minio/minio/cmd.objectAPIHandlers.PutObjectExtractHandler.func1()
e:/gopath/src/github.com/minio/minio/cmd/object-handlers.go:2130 +0x149
github.com/minio/minio/cmd.untar.func1()
e:/gopath/src/github.com/minio/minio/cmd/untar.go:250 +0x2b6
github.com/minio/minio/cmd.untar.func8()
e:/gopath/src/github.com/minio/minio/cmd/untar.go:261 +0xa4
Previous write at 0x00c01cd680e0 by goroutine 691352:
github.com/minio/minio/cmd.objectAPIHandlers.PutObjectExtractHandler.func1()
e:/gopath/src/github.com/minio/minio/cmd/object-handlers.go:2131 +0x15d
github.com/minio/minio/cmd.untar.func1()
e:/gopath/src/github.com/minio/minio/cmd/untar.go:250 +0x2b6
github.com/minio/minio/cmd.untar.func8()
e:/gopath/src/github.com/minio/minio/cmd/untar.go:261 +0xa4
```
Calling unfreezeServices twice results in panic:
```
panic: "POST /minio/peer/v32/signalservice?signal=4&sub-sys=": close of nil channel
goroutine 14703 [running]:
runtime/debug.Stack()
runtime/debug/stack.go:24 +0x65
github.com/minio/minio/cmd.setCriticalErrorHandler.func1.1()
github.com/minio/minio/cmd/generic-handlers.go:549 +0x8e
panic({0x27c3020, 0x4c9b370})
runtime/panic.go:884 +0x212
github.com/minio/minio/cmd.unfreezeServices()
github.com/minio/minio/cmd/service.go:112 +0xc7
github.com/minio/minio/cmd.(*peerRESTServer).SignalServiceHandler(0x0?, {0x4cb6af0, 0xc010b96420}, 0xc01affab00)
github.com/minio/minio/cmd/peer-rest-server.go:837 +0x13a
net/http.HandlerFunc.ServeHTTP(...)
```
If the function was called a second time `val` would not be nil, but the returned channel `ch` would be, causing the panic.
Check the channel isn't nil and also use Swap for an atomic swap instead of 2 separate operations (though we are in a mutex).
Disk level O_DIRECT support checking at xl storage initialization was
conditional on a config setting being enabled. (This never took effect
because config initialization happens after ObjectLayer is ready.) This
is not necessary as the config setting is dynamic - O_DIRECT should be
enabled via runtime config. So we need to do the disk level support
check regardless of the config setting.
- Trace needs higher buffered channels than 4000 to ensure
when we run `mc admin trace -a` it captures all information
sufficiently.
- Listen event notification needs the event channel to be
`apiRequestsMaxPerNode` * number of nodes
Currently, the retry is not fully used when there is no backup copy of
the data usage; use 5 retry attempts when we don't have any valid data,
new or backup, unless we have seen an un-recognized error.
comment in the code provides more detailed explanation
on what this PR entails and its assumptions.
this PR reduces the amount of listing() by an order
of magnitude, however there are other such calls that
still needs further optimization that shall be done
in subsequent PRs.
Add a new endpoint for "resource" metrics `/v2/metrics/resource`
This should return system metrics related to drives, network, CPU and
memory. Except for drives, other metrics should have corresponding "avg"
and "max" values also.
Reuse the real-time feature to capture the required data,
introducing CPU and memory metrics in it.
Collect the data every minute and keep updating the average and max values
accordingly, returning the latest values when the API is called.
without this the rename2() can rename the previous dataDir
causing issues for different versions of the object, only
latest version is preserved due to this bug.
Added healing code to ensure recovery of such content.
not checking w.Close() can prematurely make us
think that the w.Write() actually succeeded, apparently
Write() may or may not return an error but sometimes
only during a Close() call to the fd we may see the
error from Write() propagate.
Fdatasync(w) on the FD would return an error requiring
Close() error handling is less of a concern, however it may
happen such that fdatasync() did not return an error, where
as Close() would.
Currently, setting a new tiering target returns an error when a bucket
is versioned and the tiering credentials does not have authorization to
specify a version-id when reading or removing a specific version;
Since tiering does not require versioning anymore; avoid doing versioned
operations when performing checklist ops while adding a new tiering
configuration.
Do not error out when a provided marker is before or after the prefix, but instead just ignore it if before and return an empty list when after.
Fixes#18093
Include object and versions heal scan times when checking non-empty abandoned folders.
Furthermore don't add delay between healing versions, instead do one per object wait.
This PR changes the StatObject() to be must have for non-minio source
to being a conditional API call.
- Calls StatObject() when needed
- Calls GetObjectTagging() when needed
These calls if we do without these conditionals can cause a lot of
delays, so we avoid them if not needed in more common scenario.
all retries must not be counted as failed messages,
a failed message is a single counter not for all
retries, this PR fixes this.
Also we do not need to retry 10-times, instead we should
retry at max 3 times with some jitter to deliver the
messages.
If MinIO started with KMS enabled, MINIO_KMS_KES_KEY_NAME should
be set for server to start.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
In a perf test, one node will run speed test with all nodes. If there is
an error with a peer node, the peer node name is not included in the
error hence confusing the user.
This commit will add the peer endpoint string to the netperf error.
To ensure that policy mappings are current for service accounts
belonging to (non-derived) STS accounts (like an LDAP user's service
account) we periodically reload such mappings.
This is primarily to handle a case where a policy mapping update
notification is missed by a minio node. Such a node would continue to
have the stale mapping in memory because STS creds/mappings were never
periodically scanned from storage.
- we already have MRF for most recent failures
- we trigger healing during HEAD/GET operation
These are enough, also change the default max wait
from 5sec to 1sec for default scanner speed.
AccountInfo is quite frequently called by the Console UI
login attempts, when many users are logging in it is important
that we provide them with better responsiveness.
- ListBuckets information is cached every second
- Bucket usage info is cached for up to 10 seconds
- Prefix usage (optional) info is cached for up to 10 secs
Failure to update after cache expiration, would still
allow login which would end up providing information
previously cached.
This allows for seamless responsiveness for the Console UI
logins, and overall responsiveness on a heavily loaded
system.
From the Go specification:
"3. If the map is nil, the number of iterations is 0." [1]
Therefore, an additional nil check for before the loop is unnecessary.
[1]: https://go.dev/ref/spec#For_range
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
- remove targetClient for passing around via replicationObjectInfo{}
- remove cloing to object info unnecessarily
- remove objectInfo from replicationObjectInfo{} (only require necessary fields)
When using a chain provider all providers do not return a valid
access and secret key, an anonymous request is sent, which makes it hard
for users to figure out what is going on
In the case of S3 tiering, when AWS IAM temporary account generation returns
an error, an anonymous login will be used because of the chain provider.
Avoid this and use the AWS IAM provider directly to get a good error
message.
This helps reduce disk operations as these periodic routines would not
run concurrently any more.
Also add expired STS purging periodic operation: Since we do not scan
the on-disk STS credentials (and instead only load them on-demand) a
separate routine is needed to purge expired credentials from storage.
Currently this runs about a quarter as often as IAM refresh.
Also fix a bug where with etcd, STS accounts could get loaded into the
iamUsersMap instead of the iamSTSAccountsMap.
This allows scanner to avoid lengthy scans, skip
things appropriately and also not lose metrics in
any manner.
reduce longer deadlines for usage-cache loads/saves
to match the disk timeout which is 2minutes now per
IOP.
In situations with large number of STS credentials on disk, IAM load
time is high. To mitigate this, STS accounts will now be loaded into
memory only on demand - i.e. when the credential is used.
In each IAM cache (re)load we skip loading STS credentials and STS
policy mappings into memory. Since STS accounts only expire and cannot
be deleted, there is no risk of invalid credentials being reused,
because credential validity is checked when it is used.
Currently we have IOPs of these patterns
```
[OS] os.Mkdir play.min.io:9000 /disk1 2.718µs
[OS] os.Mkdir play.min.io:9000 /disk1/data 2.406µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys 4.068µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys/tmp 2.843µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys/tmp/d89c8ceb-f8d1-4cc6-b483-280f87c4719f 20.152µs
```
It can be seen that we can save quite Nx levels such as
if your drive is mounted at `/disk1/minio` you can simply
skip sending an `Mkdir /disk1/` and `Mkdir /disk1/minio`.
Since they are expected to exist already, this PR adds a way
for us to ignore all paths upto the mount or a directory which
ever has been provided to MinIO setup.
Previously existing objects were queued to single worker and MRF re-queues
are also handled by same worker - this does not fully use the available
bandwidth in case there is no incoming workload.
Errors such as
```
returned an error (context deadline exceeded) (*fmt.wrapError)
```
```
(msgp: too few bytes left to read object) (*fmt.wrapError)
```
configs from 2020 server throws an
error due to deprecation of the keys
however an attempt is made to parse
them, we should have chosen existing
defaults - this PR fixes that.
Fix drive rotational calculation status
If a MinIO drive path is mounted to a partition and not a real disk,
getting the rotational status would fail because Linux does not expose
that status to partition; In other words,
/sys/block/drive-partition-name/queue/rotational does not exist;
To fix the issue, the code will search for the rotational status of the
disk that hosts the partition, and this can be calculated from the
real path of /sys/class/block/<drive-partition-name>
This change enables embedding files in ZIP with custom permissions.
Also uses default creds for starting MinIO based on inspect data.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
objects with 10,000 parts and many of them can
cause a large memory spike which can potentially
lead to OOM due to lack of GC.
with previous PR reducing the memory usage significantly
in #17963, this PR reduces this further by 80% under
repeated calls.
Scanner sub-system has no use for the slice of Parts(),
it is better left empty.
```
benchmark old ns/op new ns/op delta
BenchmarkToFileInfo/ToFileInfo-8 295658 188143 -36.36%
benchmark old allocs new allocs delta
BenchmarkToFileInfo/ToFileInfo-8 61 60 -1.64%
benchmark old bytes new bytes delta
BenchmarkToFileInfo/ToFileInfo-8 1097210 227255 -79.29%
```
- this PR avoids sending a large ChecksumInfo slice
when its not needed
- also for a file with XLV2 format there is no reason
to allocate Checksum slice while reading
Keys are helpful to ensure the strict ordering of messages, however currently the
code uses a random request id for every log, hence using the request-id
as a Kafka key is not serve any purpose;
This commit removes the usage of the key, to also fix the audit issue from
internal subsystem that does not have a request ID.
This PR adds new bucket replication graphs for better and granular
monitoring of bucket replication. Also arranged all replication graphs
together.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
to track the replication transfer rate across different nodes,
number of active workers in use and in-queue stats to get
an idea of the current workload.
This PR also adds replication metrics to the site replication
status API. For site replication, prometheus metrics are
no longer at the bucket level - but at the cluster level.
Add prometheus metric to track credential errors since uptime
There is some consistent confusion between the Community Helm Chart in this repo and the MinIO Kubernetes Operator Helm Chart.
This change seeks to clarify the differences between the two charts and which ones are community maintained vs MinIO maintained.
replicationTimestamp might differ if there were retries
in replication and the retried attempt overwrote in
quorum but enough shards with newer timestamp causing
the existing timestamps on xl.meta to be invalid, we
do not rely on this value for anything external.
this is purely a hint for debugging purposes, but there
is no real value in it considering the object itself
is in-tact we do not have to spend time healing this
situation.
we may consider healing this situation in future but
that needs to be decoupled to make sure that we do not
over calculate how much we have to heal.
.metacache objects are transient in nature, and are better left to
use page-cache effectively to avoid using more IOPs on the disks.
this allows for incoming calls to be not taxed heavily due to
multiple large batch listings.
given a versionId the mtime is always the same, it
can never be different than its original value.
versionIds also do not conflict, since they are uuid's
and unique practically forever.
we expect a certain level of IOPs and latency so this is okay.
fixes other miscellaneous bugs
- such as hanging on mrfCh <- when the context is canceled
- queuing MRF heal when the context is canceled
- remove unused saveStateCh channel
In distributed setup with a load balancer, randmoly any server
would report the metrics `minio_cluster_bucket_total` and
`minio_cluster_usage_object_total` and while graphing it, we should
take max of reported values.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
This commit updates the minio/kes-go dependency
to v0.2.0 and updates the existing code to work
with the new KES APIs.
The `SetPolicy` handler got removed since it
may not get implemented by KES at all and could
not have been used in the past since stateless KES
is read-only w.r.t. policies and identities.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Bonus fixes include
- do not have to write final xl.meta (renameData) does this
already, saves some IOPs.
- make sure to purge the multipart directory properly using
a recursive delete, otherwise this can easily pile up and
rely on the stale uploads cleanup.
fixes#17863
This reverts commit bf3901342c.
This is to fix a regression caused when there are inconsistent
versions, but one version is in quorum. SuccessorModTime issue
must be fixed differently.
batch status can perpetually wait after completion
due to a race between the MetricsHandler() returning
the active metrics in intervals of 1sec and delete
of metrics after job completion.
this PR ensures that we keep the 'status' around
for a while, i.e upto 24hrs for all the batch jobs.
Two fields in lifecycles made GOB encoding consistently fail with `gob: type lifecycle.Prefix has no exported fields`.
This meant that in distributed systems listings would never be able to continue and would restart on every call.
Fix issues and be sure to log these errors at least once per bucket. We may see some connectivity errors here, but we shouldn't hide them.
When listing getObjectFileInfo can return `io.EOF` if file is being written.
When we wrap the error it will *not* retry upstream, since `io.EOF` is a valid return value.
Allow one retry before returning errors and canceling the listing.
* optimize deletePrefix, use direct set location via object name
instead of fanning out the calls for an object force delete
we can assume the set location and not do fan-out calls
* Apply suggestions from code review
Co-authored-by: Krishnan Parthasarathi <krisis@users.noreply.github.com>
---------
Co-authored-by: Krishnan Parthasarathi <krisis@users.noreply.github.com>
Bonus:
- avoid calling DiskInfo() calls when missing blocks
instead heal the object using MRF operation.
- change the max_sleep to 250ms beyond that we will
not stop healing.
As all replication metrics are moved at bucket level, all replication
graphs as well are added under minio-bucket.json. Removing the independent
replication dashboard.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
ignoring valid objects with valid replication metadata
after the Prefix was disabled must still honor the older
metadata.
this can lead to unexpected results, allow it during
READ phase always.
// UnmarshalStrict is like Unmarshal except that any fields that are found
// in the data that do not have corresponding struct members, or mapping
// keys that are duplicates, will result in
// an error.
batch replication pull must preserve versionID regardless
of destination bucket versioning configuration.
This is similar to the issue with decommissioning and rebalancing
health checks were missing for drives replaced since
- HealFormat() would replace the drives without a health check
- disconnected drives when they reconnect via connectEndpoint()
the loop also loses health checks for local disks and merges
these into a single code.
- other than this separate cleanUp, health check variables to avoid
overloading them with similar requirements.
- also ensure that we compete via context selector for disk monitoring
such that the canceled disks don't linger around longer waiting for
the ticker to trigger.
- allow disabling active monitoring.
```
minio[1032735]: panic: label value "\xc0.\xc0." is not valid UTF-8
minio[1032735]: goroutine 1781101 [running]:
minio[1032735]: github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...)
```
log such errors for investigation
Send() is synchronous and can affect the latency of S3 requests when the
logger buffer is full.
Avoid checking if the HTTP target is online or not and increase the
workers anyway since the buffer is already full.
Also, avoid logs flooding when the audit target is down.
Limit large uploads (> 128MiB) to a max of 10 workers, intent is to avoid
larger uploads from using all replication bandwidth, giving room for smaller
uploads to sync faster.
slower drives get knocked off because they are too slow via
active monitoring, we do not need to block calls arbitrarily.
Serializing adds latencies for already slow calls, remove
it for SSDs/NVMEs
Also, add a selection with context when writing to `out <-`
channel, to avoid any potential blocks.
Revert "don't error when asked for 0-based range on empty objects (#17708)"
This reverts commit 7e76d66184.
There is no valid way to specify offsets in a 0-byte file. Blame it on the [RFC](https://datatracker.ietf.org/doc/html/rfc7233#section-4.4)
> The 416 (Range Not Satisfiable) status code indicates that none of the ranges in the
> request's Range header field (Section 3.1) overlap the current extent of the selected resource...
A request for "bytes=0-" is a request for the first byte of a resource. If the resource is 0-length,
the range [0,0] does not overlap the resource content and the server responds with an error.
In a reverse proxying setup, a proxy in front of MinIO may attempt to
request objects in slices for enhanced cache efficiency. Since such a
a proxy cannot have prior knowledge of how large a requested resource is,
it usually sends a header of the form:
Range: 0-$slice_size
... and, depending on the size of the resource, expects either:
- an empty response, if $resource_size == 0
- a full response, if $resource_size <= $slice_size
- a partial response, if $resource_size > $slice_size
Prior to this change, MinIO would respond 416 Range Not Satisfiable if a
client tried to request a range on an empty resource. This behavior is
technically consistent with RFC9110[1] – However, it renders sliced
reverse proxying, such as implemented in Nginx, broken in the case of
empty files. Nginx itself seems to break this convention to enable
"useful" responses in these cases, and MinIO should probably do that
too.
[1]: https://www.rfc-editor.org/rfc/rfc9110#byte.ranges
sending whitespace character with CompleteMultipartUpload()
with 200 OK was an AWS S3 compatible implementation detail,
and it was expected that the client SDK must look for both
successful XML as well as error XML for 200 OK.
But this is not useful anymore on MinIO, since we do not
have any large delayed coalescing of parts anymore.
users/customers do not have a reasonable number of buckets anymore,
this is why we must avoid overpopulating cluster endpoints, instead
move the bucket monitoring to a separate endpoint.
some of it's a breaking change here for a couple of metrics, but
it is imperative that we do it to improve the responsiveness of
our Prometheus cluster endpoint.
Bonus: Added new cluster metrics for usage, objects and histograms
Using this script, post decrypt we should be able to bring up the
MinIO instance with same configuration.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Sometimes IAM fails to load certain items, which could be a user,
a service account or a policy but with not enough information for
us to debug.
This commit will create a more descriptive error to make it easier to
debug in such situations.
mc admin trace -a will be able to quickly show
401 Unauthorized header to pinpoint trivial issues
between nodes, such as wrong root
credentials and skewed time.
objects/versions that are not expired via NewerNoncurrentVersions
must be properly returned to be applied under further ILM actions.
this would cause legitimately expired objects to be missed
from expiration.
this randomness is needed to avoid scanning
the same buckets across different erasure sets,
in the same order.
allow random buckets to be scanned instead
allowing a wider spread of ILM, replication
checks.
Additionally do not loop over twice to fill
the channel, fill the channel regardless of
having bucket new or old.
A new middleware function is added for admin handlers, including options
for modifying certain behaviors. This admin middleware:
- sets the handler context via reflection in the request and sends AuditLog
- checks for object API availability (skipping it if a flag is passed)
- enables gzip compression (skipping it if a flag is passed)
- enables header tracing (adding body tracing if a flag is passed)
While the new function is a middleware, due to the flags used for
conditional behavior modification, which is used in each route registration
call.
To try to ensure that no regressions are introduced, the following
changes were done mechanically mostly with `sed` and regexp:
- Remove defer logger.AuditLog in admin handlers
- Replace newContext() calls with r.Context()
- Update admin routes registration calls
Bonus: remove unused NetSpeedtestHandler
Since the new adminMiddleware function checks for object layer presence
by default, we need to pass the `noObjLayerFlag` explicitly to admin
handlers that should work even when it is not available. The following
admin handlers do not require it:
- ServerInfoHandler
- StartProfilingHandler
- DownloadProfilingHandler
- ProfileHandler
- SiteReplicationDevNull
- SiteReplicationNetPerf
- TraceHandler
For these handlers adminMiddleware does not check for the object layer
presence (disabled by passing the `noObjLayerFlag`), and for all other
handlers, the pre-check ensures that the handler is not called when the
object layer is not available - the client would get a
ErrServerNotInitialized and can retry later.
This `noObjLayerFlag` is added based on existing behavior for these
handlers only.
Add check every 2 minutes to see if a write+read operation can complete.
If disk is unresponsive for 2 minutes or returns errFaultyDisk, take it offline.
Simplify MRF queueing and add backlog handler
- Limit re-tries to 3 to avoid repeated re-queueing. Fall offs
to be re-tried when the scanner revisits this object or upon access.
- Change MRF to have each node process only its MRF entries.
- Collect MRF backlog by the node to allow for current backlog visibility
Now it would list details of all KMS instances with additional
attributes `endpoint` and `version`. In the case of k8s-based
deployment the list would consist of a single entry.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
This would better to record the correct API name so that
any verification around audit logs to figure out if required
APIs are called required no of times, would be correct.
Here in this case of policy attached, API `AttachDetachPolicyBuiltin`
would be called with `requestPath` as `/minio/admin/v3/idp/builtin/policy/attach`
and in case of detach policy the value would be `/minio/admin/v3/idp/builtin/policy/detach`
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Also shutdown poll add jitter, to verify if the shutdown
sequence can finish before 500ms, this reduces the overall
time taken during "restart" of the service.
Provides speedup for `mc admin service restart` during
active I/O, also ensures that systemd doesn't treat the
returned 'error' as a failure, certain configurations in
systemd can cause it to 'auto-restart' the process by-itself
which can interfere with `mc admin service restart`.
It can be observed how now restarting the service is
much snappier.
on unversioned buckets its possible that 0-byte objects
might lose quorum on flaky systems, allow them to be same
as DELETE markers. Since practically speak they have no
content.
Optimize DeleteObject API to avoid extra
GetObjectInfo call on the replicating side.
For receiving side, it is just a regular
DeleteObject call.
Bonus: Fix a corner case where version purged is
absent on target (either due to replication not yet
complete or target version already deleted in a
one-way replication or when replication was disabled).
In such cases, mark version purge complete.
Since `addCustomerHeaders` middleware was after the `httpTracer`
middleware, the request ID was not set in the http tracing context. By
reordering these middleware functions, the request ID header becomes
available. We also avoid setting the tracing context key again in
`newContext`.
Bonus: All middleware functions are renamed with a "Middleware" suffix
to avoid confusion with http Handler functions.
* Reduce allocations
* Add stringsHasPrefixFold which can compare string prefixes, while ignoring case and not allocating.
* Reuse all msgp.Readers
* Reuse metadata buffers when not reading data.
* Make type safe. Make buffer 4K instead of 8.
* Unslice
DNS refresh() in-case of MinIO can safely re-use
the previous values on bare-metal setups, since
bare-metal arrangements do not change DNS in any
manner commonly.
This PR simplifies that, we only ever need DNS caching
on bare-metal setups.
- On containerized setups do not enable DNS
caching at all, as it may have adverse effects on
the overall effectiveness of k8s DNS systems.
k8s DNS systems are dynamic and expect applications
to avoid managing DNS caching themselves, instead
provide a cleaner container native caching
implementations that must be used.
- update IsDocker() detection, including podman runtime
- move to minio/dnscache fork for a simpler package
Following extension allows users to specify immediate purge of
all versions as soon as the latest version of this object has
expired.
```
<LifecycleConfiguration>
<Rule>
<ID>ClassADocRule</ID>
<Filter>
<Prefix>classA/</Prefix>
</Filter>
<Status>Enabled</Status>
<Expiration>
<Days>3650</Days>
<ExpiredObjectAllVersions>true</ExpiredObjectAllVersions>
</Expiration>
</Rule>
...
```
- look for requested encryption while compressing
not just via HTTP Headers, but also via multipart
metadata
- look for SSE-S3 etag decryption not just via HTTP
Headers, but also via multipart metadata
fixes#17519
current decommission traces were missing for
- Skipped ILM expired versions
- Skipped single DELETE marked version
- A success or failure in decommissioning DELETE marker
- allow additional info to be shared in DecomStatus() API
there is a possibility that slow drives can actually add latency
to the overall call, leading to a large spike in latency.
this can happen if there are other parallel listObjects()
calls to the same drive, in-turn causing each other to sort
of serialize.
this potentially improves performance and makes PutObject()
also non-blocking.
This change adds a `Secret` property to `HelpKV` to identify secrets
like passwords and auth tokens that should not be revealed by the server
in its configuration fetching APIs. Configuration reporting APIs now do
not return secrets.
Will combine or write partial data of each version found in the inspect data.
Example:
```
> xl-meta -export -combine inspect-data.1228fb52.zip
(... metadata json...)
}
Attempting to combine version "994f1113-da94-4be1-8551-9dbc54b204bc".
Read shard 1 Data shards 9 Parity 4 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-01-of-13.data)
Read shard 2 Data shards 9 Parity 4 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-02-of-13.data)
Read shard 3 Data shards 9 Parity 4 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-03-of-13.data)
Read shard 4 Data shards 9 Parity 4 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-04-of-13.data)
Read shard 6 Data shards 9 Parity 4 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-06-of-13.data)
Read shard 7 Data shards 9 Parity 4 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-07-of-13.data)
Read shard 8 Data shards 8 Parity 5 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-08-of-13.data)
Read shard 9 Data shards 8 Parity 5 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-09-of-13.data)
Read shard 10 Data shards 8 Parity 5 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-10-of-13.data)
Read shard 11 Data shards 8 Parity 5 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-11-of-13.data)
Read shard 13 Data shards 8 Parity 5 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-13-of-13.data)
Attempting to reconstruct using parity sets:
* Setup: Data shards: 9 - Parity blocks: 6
Have 6 complete remapped data shards and 6 complete parity shards. Could NOT reconstruct: too few shards given
* Setup: Data shards: 8 - Parity blocks: 5
Have 5 complete remapped data shards and 5 complete parity shards. Could reconstruct completely
0 bytes missing. Truncating 0 from the end.
Wrote output to 994f1113-da94-4be1-8551-9dbc54b204bc.complete
```
So far only inline data, but no real reason that external data can't also be included with some handling of blocks.
Supports only unencrypted data.
For policy attach/detach API to work correctly the server should hold a
lock before reading existing policy mapping and until after writing the
updated policy mapping. This is fixed in this change.
A site replication bug, where LDAP policy attach/detach were not
correctly propagated is also fixed in this change.
Bonus: Additionally, the server responds with the actual (or net)
changes performed in the attach/detach API call. For e.g. if a user
already has policy A applied, and a call to attach policies A and B is
performed, the server will respond that B was attached successfully.
A continuation of PR #17479 for rebalance behavior must
also match the decommission behavior.
Fixes bug where rebalance would ignore rebalancing object
versions after one of the version returned "ObjectNotFound"
while decommissioning it can so happen that the non-current
versions are all expired but there is a DEL marker as the
latest version.
For such objects, we should not decommission them instead
calculate the remaining versions and if the remaining versions
is one and that version is a DEL marker consider such
an object not to be scheduled for decommissioning.
With the current asynchronous behaviour in sending notification events
to the targets, we can't provide guaranteed delivery as the systems
might go for restarts.
For such event-driven use-cases, we can provide an option to enable
synchronous events where the APIs wait until the event is successfully
sent or persisted.
This commit adds 'MINIO_API_SYNC_EVENTS' env which when set to 'on'
will enable sending/persisting events to targets synchronously.
A state is updated with a delete marker, which does not have parity or
data blocks defined, which can cause the integer divide by zero panics.
This commit fixes to avoid panics.
on "unversioned" buckets there are situations
when successive concurrent I/O can lead to
an inconsistent state() with mtime while the
etag might be the same for the object on disk.
in such a scenario it is possible for us to
allow reading of the object since etag matches
and if etag matches we are guaranteed that we
have enough copies the object will be readable
and same.
This PR allows fallback in such scenarios.
This PR also returns the replication status in
proxy calls and defers replication attempt if
HEAD on object version returned a error different
from NoSuchKey
A specific node should do the decommissioning task, however routing the
start decommissioning to that node was not working properly.
Co-authored-by: Anis Elleuch <anis@min.io>
fixes an issue under bucket replication could cause
ETags for replicated SSE-S3 single part PUT objects,
to fail as we would attempt a decryption while listing,
or stat() operation.
- lifecycle must return InvalidArgument for rule errors
- do not return `null` versionId in HTTP header
- reject mixed SSE uploads with correct error message
- getObjectTagging to be allowed for anonymous policies
- return correct errors for invalid retention period
- return sorted list of tags for an object
- putObjectTagging must return 200 OK not 204 OK
- return 409 ErrObjectLockConfigurationNotAllowed for existing buckets
PUT calls cannot afford to have large latency build-ups due
to contentious usage.json, or worse letting them fail with
some unexpected error, this can happen when this file is
concurrently being updated via scanner or it is being
healed during a disk replacement heal.
However, these are fairly quick in theory, stressed clusters
can quickly show visible latency this can add up leading to
invalid errors returned during PUT.
It is perhaps okay for us to relax this error return requirement
instead, make sure that we log that we are proceeding to take in
the requests while the quota is using an older value for the quota
enforcement. These things will reconcile themselves eventually,
via scanner making sure to overwrite the usage.json.
Bonus: make sure that storage-rest-client sets ExpectTimeouts to
be 'true', such that DiskInfo() call with contextTimeout does
not prematurely disconnect the servers leading to a longer
healthCheck, back-off routine. This can easily pile up while also
causing active callers to disconnect, leading to quorum loss.
DiskInfo is actively used in the PUT, Multipart call path for
upgrading parity when disks are down, it in-turn shouldn't cause
more disks to go down.
Removes the bloom filter since it has so limited usability, often gets saturated anyway and adds a bunch of complexity to the scanner.
Also removes a tiny bit of CPU by each write operation.
Global leader lock was first designated to only acquired once
until the node is killed. However, currently, the code acquires
it repeatedly during the lifetime of the server, now there is a
goroutine leak.
if the certs are the same in an environment where the
cert files are symlinks (e.g Kubernetes), then we resort
to reloading certs every 15mins - we can avoid reload
of the kes client instance. Ensure that the price to pay
for contending with the lock must happen when necessary.
remote error is not required to be passed back to the
client - this is mostly because we have healing that should
eventually, catch up on this and heal the bucket.
Ensure delete marker replication success, especially since the
recent optimizations to heal on HEAD, LIST and GET can force
replication attempts on delete marker before underlying object
version could have synced.
Move to using `xl.meta` data structure to keep temporary partInfo,
this allows for a future change where we move to different parts to
different drives.
PUT shall only proceed if pre-conditions are met, the new
code uses
- x-minio-source-mtime
- x-minio-source-etag
to verify if the object indeed needs to be replicated
or not, allowing us to avoid StatObject() call.
When limiting listing do not count delete, since they may be discarded.
Extend limit, since we may be discarding the forward-to marker.
Fix directories always being sent to resolve, since they didn't return as match.
On occasion this test fails:
```
2022-09-12T17:22:44.6562737Z === RUN TestGetObjectWithOutdatedDisks
2022-09-12T17:22:44.6563751Z erasure-object_test.go:1214: Test 2: Expected data to have md5sum = `c946b71bb69c07daf25470742c967e7c`, found `7d16d23f07072af1a809707ba101ae07`
2
```
Theory: Both objects are written with the same timestamp due to lower timer resolution on Windows. This results in secondary resolution, which is deterministic, but random.
Solution: Instead of hacking in a wait we request the specific version we want. Should still keep the test relevant.
Bonus: Remote action dependency for vulncheck
If replication config could not be read from bucket metadata for some
reason, issue a panic so that unexpected replication outcomes can
be avoided for replicated buckets.
For similar reasons, adding a panic while fetching object-lock config
if it failed for reason other than non-existence of config.
minio_inter_node_traffic_errors_total currently does not track
requests body write/read errors of internode REST communications.
This commit fixes this by wrapping resp.Body.
to avoid relying on scanner-calculated replication metrics.
This will improve the accuracy of the replication stats reported.
This PR also adds on to #15556 by handing replication
traffic that could not be queued by available workers to the
MRF queue so that entries in `PENDING` status are healed faster.
500k is a reasonable limit for any single MinIO
cluster deployment, in future we may increase this
value.
However for now we are going to keep this limit.
When healing is parallelized by setting the ` _MINIO_HEAL_WORKERS`
environment variable, multiple goroutines may race while updating the disk's
healing tracker. This change serializes only these concurrent updates using a
channel. Note, the healing tracker is still not concurrency safe in other contexts.
This PR is a continuation of the previous change instead
of returning an error, instead trigger a spot heal on the
'xl.meta' and return only after the healing is complete.
This allows for future GETs on the same resource to be
consistent for any version of the object.
xl.meta gets written and never rolled back, however
we definitely need to validate the state that is
persisted on the disk, if there are inconsistencies
- more than write quorum we should return an error
to the client
- if write quorum was achieved however there are
inconsistent xl.meta's we should simply trigger
an MRF on them
The `clusterInfo` struct in admin-handlers is same as
madmin.ClusterRegistrationInfo, except for small differences in field
names.
Removing this and using madmin.ClusterRegistrationInfo in its place will
help in following ways:
- The JSON payload generated by mc in case of cluster registration will
be consistent (same keys) with cluster.info generated by minio as part
of the profile and inspect zip
- health-analyzer can parse the cluster.info using the same struct and
won't have to define it's own
Currently, there is a short time window where the code is allowed
to save the status of a replication resync. Currently, the window is
`now.Sub(st.EndTime) <= resyncTimeInterval`. Also, any failure to
write in the backend disks is not retried.
Refactor the code a little bit to rely on the last timestamp of a
successful write of the resync status of any given bucket in the
backend disks.
When replication is enabled in a particular bucket, the listing will send
objects to bucket replication, but it is also sending prefixes for non
recursive listing which is useless and shows a lot of error logs.
This commit will ignore prefixes.
under some sequence of events following code would
reach an infinite loop.
```
idx1, idx2 := 0, 1
for ; idx2 != idx1; idx2++ {
fmt.Println(idx2)
}
```
fixes#15639
A lot of warning messages are printed in CI/CD failures generated by go
test. Avoid that by requiring at least Error level for logging when
doing go test.
inlined data often is bigger than the allowed
O_DIRECT alignment, so potentially we can write
'xl.meta' without O_DSYNC instead we can rely on
O_DIRECT + fdatasync() instead.
This PR allows O_DIRECT on inlined data that
would gain the benefits of performing O_DIRECT,
eventually performing an fdatasync() at the end.
Performance boost can be observed here for small
objects < 128KiB. The performance boost is mainly
seen on HDD, and marginal on NVMe setups.
* add a new line to the end of the credentials file when creating a user
* add extra volumes and mounts option into helm chart
* add extra volumes and extra volume mounts option for job resources
When a node finds a change in the other replication cluster and applies
to itself will already notify other peers. No need for all nodes in a
given cluster to do site replication healing, only one node is
sufficient.
This PR improves the replication failure healing by persisting
most recent failures to disk and re-queuing them until the replication
is successful.
While this does not eliminate the need for healing during a full scan,
queuing MRF vastly improves the ETA to keeping replicated buckets
in sync as it does not wait for the scanner visit to detect unreplicated
object versions.
competing calls on the same object on versioned bucket
mutating calls on the same object may unexpected have
higher delays.
This can be reproduced with a replicated bucket
overwriting the same object writes, deletes repeatedly.
For longer locks like scanner keep the 1sec interval
This PR fixes possible leaks that may emanate from not
listening on context cancelation or timeouts.
```
goroutine 60957610 [chan send, 16 minutes]:
github.com/minio/minio/cmd.(*erasureServerPools).Walk.func1.1.1(...)
github.com/minio/minio/cmd/erasure-server-pool.go:1724 +0x368
github.com/minio/minio/cmd.listPathRaw({0x4a9a740, 0xc0666dffc0},...
github.com/minio/minio/cmd/metacache-set.go:1022 +0xfc4
github.com/minio/minio/cmd.(*erasureServerPools).Walk.func1.1()
github.com/minio/minio/cmd/erasure-server-pool.go:1764 +0x528
created by github.com/minio/minio/cmd.(*erasureServerPools).Walk.func1
github.com/minio/minio/cmd/erasure-server-pool.go:1697 +0x1b7
```
The bottom line is delete markers are a nuisance,
most applications are not version aware and this
has simply complicated the version management.
AWS S3 gave an unnecessary complication overhead
for customers, they need to now manage these
markers by applying ILM settings and clean
them up on a regular basis.
To make matters worse all these delete markers
get replicated as well in a replicated setup,
requiring two ILM settings on each site.
This PR is an attempt to address this inferior
implementation by deviating MinIO towards an
idempotent delete marker implementation i.e
MinIO will never create any more than single
consecutive delete markers.
This significantly reduces operational overhead
by making versioning more useful for real data.
This is an S3 spec deviation for pragmatic reasons.
Queue failed/pending replication for healing during listing and GET/HEAD
API calls. This includes healing of existing objects that were never
replicated or those in the middle of a resync operation.
This PR also fixes a bug in ListObjectVersions where lifecycle filtering
should be done.
when object speedtest is running keep writing
previous speedtest result back to client until
we have a new result - this avoids sending back
blank entries in between the speedtest when it
is running in 'autotune' mode.
```
commit 7bdaf9bc50
Author: Aditya Manthramurthy <donatello@users.noreply.github.com>
Date: Wed Jul 24 17:34:23 2019 -0700
Update on-disk storage format for users system (#7949)
```
Bonus: fixes a bug when etcd keys were being re-encrypted.
Currently, the code doesn't check if the user creating a bucket with
locking feature has bucket locking and versioning permissions enabled,
adding it in accordance with S3 spec.
https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateBucket.html
Object Lock - If ObjectLockEnabledForBucket is set to true in your CreateBucket request,
s3:PutBucketObjectLockConfiguration and s3:PutBucketVersioning permissions are required.
Capture average, p50, p99, p999 response times
and ttfb values. These are needed for latency
measurements and overall understanding of our
speedtest results.
listConfigItems creates a goroutine but sometimes callers will
exit without properly asking listAllIAMConfigItems() to stop sending
results, hence a goroutine leak.
Create a new context and cancel it for each listAllIAMConfigItems
call.
This commit adds support for automatically reloading
the MinIO client certificate for authentication to KES.
The client certificate will now be reloaded:
- when the private key / certificate file changes
- when a SIGHUP signal is received
- every 15 minutes
Fixes#14869
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
The path is marked dirty automatically when healObject() is called, which is
wrong. HealObject() is called during self-healing and this will lead to
an increase in the false positive result of the bloom filter.
Also move NSUpdated() from renameData() and call it directly in
CompleteMultipart and PutObject, this is not a functional change but
it will make it less prone to errors in the future.
mc admin config reset <alias> notify_webhook:something was not working
properly.
The reason is that GetSubSys() was not calculating the target
name properly because it is quitting early when the number of config
inputs ('notify_webhook:something' in this case) is equal to 1.
This commit will make the code calculates always calculate the target
name if found.
There is a known rare issue in the current version 1.30.0 described here
https://github.com/Shopify/sarama/issues/2241.
Update the library to 1.35.0
Bonus: update shirou/gopsutil v3.22.5 to v3.22.6 to fix a compilation
error for OpenBSD
smaller setups may have less drives per server choosing
the concurrency based on number of local drives, and let
the MinIO server change the overall concurrency as
necessary.
It is possible for anyone with admin access to relatively
to get any content of any random OS location by simply
providing the file with 'mc admin update alias/ /etc/passwd`.
Workaround is to disable 'admin:ServiceUpdate' action. Everyone
is advised to upgrade to this patch.
Thanks to @alevsk for finding this bug.
this has been observed in multiple environments
where the setups are small `speedtest` naturally
fails with default '10s' and the concurrency
of '32' is big for such clusters.
choose a smaller value i.e equal to number of
drives in such clusters and let 'autotune'
increase the concurrency instead.
fixes#15334
- re-use net/url parsed value for http.Request{}
- remove gosimple, structcheck and unusued due to https://github.com/golangci/golangci-lint/issues/2649
- unwrapErrs upto leafErr to ensure that we store exactly the correct errors
"consoleAdmin" was used as the policy for root derived accounts, but this
lead to unexpected bugs when an administrator modified the consoleAdmin
policy
This change avoids evaluating a policy for root derived accounts as by
default no policy is mapped to the root user. If a session policy is
attached to a root derived account, it will be evaluated as expected.
This PR changes the handling of bucket deletes for site
replicated setups to hold on to deleted bucket state until
it syncs to all the clusters participating in site replication.
Currently, if one server in a distributed setup fails to upgrade
due to any reasons, it is not possible to upgrade again unless
nodes are restarted.
To fix this, split the upgrade process into two steps :
- download the new binary on all servers
- If successful, overwrite the old binary with the new one
This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.
Prior to this commit, temporary directory created using `ioutil.TempDir`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
defer func() {
if err := os.RemoveAll(dir); err != nil {
t.Fatal(err)
}
}
is also tedious, but `t.TempDir` handles this for us nicely.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Add cluster info to inspect and profiling archive.
In addition to the existing data generation for both inspect and profiling,
cluster.info file is added. This latter contains some info of the cluster.
The generation of cluster.info is is done as the last step and it can fail
if it exceed 10 seconds.
a/b/c/d/ where `a/b/c/` exists results in additional syscalls
such as an Lstat() call to verify if the `a/b/c/` exists
and its a directory.
We do not need to do this on MinIO since the parent prefixes
if exist, we can simply return success without spending
additional syscalls.
Also this implementation attempts to simply use Access() calls
to avoid os.Stat() calls since the latter does memory allocation
for things we do not need to use.
Access() is simpler since we have a predictable structure on
the backend and we know exactly how our path structures are.
A huge number of goroutines would build up from various monitors
When creating test filesystems provide a context so they can shut down when no longer needed.
Do completely independent multipart uploads.
In distributed mode, a lock was held to merge each multipart
upload as it was added. This lock was highly contested and
retries are expensive (timewise) in distributed mode.
Instead, each part adds its metadata information uniquely.
This eliminates the per object lock required for each to merge.
The metadata is read back and merged by "CompleteMultipartUpload"
without locks when constructing final object.
Co-authored-by: Harshavardhana <harsha@minio.io>
This commit adds a `context.Context` to the
the KMS `{Stat, CreateKey, GenerateKey}` API
calls.
The context will be used to terminate external calls
as soon as the client requests gets canceled.
A follow-up PR will add a `context.Context` to
the remaining `DecryptKey` API call.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Uploading a part object can leave an inconsistent state inside
.minio.sys/multipart where data are uploaded but xl.meta is not
committed yet.
Do not list upload-ids that have this state in the multipart listing.
The default replica value is 16 (right now) which can lead to massive
resource consumption on one node in smaller clusters. The idea for this
addition is to allow users to specify how the pods (replicas) are being
spread across the cluster. It gives more control over this Helm Release
in smaller clusters where most worker nodes have taints.
As this Kubernetes feature exists since Kubernetes 1.19 and is only
useful for a replica count > 1, this was taken into account.
Since this is a MinIO specific extension in the replication config,
default this to Disabled to allow other sdks to be used to configure
replication rules.
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Use losetup to create fake disks, start a MinIO cluster, umount
one disk, and fails if the mount point directory will have format.json
recreated. It should fail because the mount point directory will belong
to the root disk after unmount.
Add up to 256 bytes of padding for compressed+encrypted files.
This will obscure the obvious cases of extremely compressible content
and leave a similar output size for a very wide variety of inputs.
This does *not* mean the compression ratio doesn't leak information
about the content, but the outcome space is much smaller,
so often *less* information is leaked.
Make bucket requests sent after decommissioning is started are not
created in a suspended pool. Therefore listing buckets should avoid
suspended pools as well.
Rename Trigger -> Event to be a more appropriate
name for the audit event.
Bonus: fixes a bug in AddMRFWorker() it did not
cancel the waitgroup, leading to waitgroup leaks.
There is no point in compressing very small files.
Typically the effective size on disk will be the same due to disk blocks.
So don't waste resources on extremely small files.
We don't check on multipart. 1) because we don't know and 2) this is very likely a big object anyway.
This commit adds a minimal set of KMS-related metrics:
```
# HELP minio_cluster_kms_online Reports whether the KMS is online (1) or offline (0)
# TYPE minio_cluster_kms_online gauge
minio_cluster_kms_online{server="127.0.0.1:9000"} 1
# HELP minio_cluster_kms_request_error Number of KMS requests that failed with a well-defined error
# TYPE minio_cluster_kms_request_error counter
minio_cluster_kms_request_error{server="127.0.0.1:9000"} 16790
# HELP minio_cluster_kms_request_success Number of KMS requests that succeeded
# TYPE minio_cluster_kms_request_success counter
minio_cluster_kms_request_success{server="127.0.0.1:9000"} 348031
```
Currently, we report whether the KMS is available and how many requests
succeeded/failed. However, KES exposes much more metrics that can be
exposed if necessary. See: https://pkg.go.dev/github.com/minio/kes#Metric
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
If more than 1M folders (objects or prefixes) are found at the top level in a bucket allow it to be compacted.
While very suboptimal structure we should limit memory usage at some point.
GetDiskInfo() uses timedValue to cache the disk info for one second.
timedValue behavior was recently changed to return an old cached value
when calculating a new value returns an error.
When a mount point is empty, GetDiskInfo() will return errUnformattedDisk,
timedValue will return cached disk info with unexpected IsRootDisk value,
e.g. false if the mount point belongs to a root disk. Therefore, the mount
point will be considered a valid disk and will be formatted as well.
This commit will also add more defensive code when marking root disks:
always mark a disk offline for any GetDiskInfo() error except
errUnformattedDisk. The server will try anyway to reconnect to those
disks every 10 seconds.
it is not safe to pass around sync.Map
through pointers, as it may be concurrently
updated by different callers.
this PR simplifies by avoiding sync.Map
altogether, we do not need sync.Map
to keep object->erasureMap association.
This PR fixes a crash when concurrently
using this value when audit logs are
configured.
```
fatal error: concurrent map iteration and map write
goroutine 247651580 [running]:
runtime.throw({0x277a6c1?, 0xc002381400?})
runtime/panic.go:992 +0x71 fp=0xc004d29b20 sp=0xc004d29af0 pc=0x438671
runtime.mapiternext(0xc0d6e87f18?)
runtime/map.go:871 +0x4eb fp=0xc004d29b90 sp=0xc004d29b20 pc=0x41002b
```
The current code uses approximation using a ratio. The approximation
can skew if we have multiple pools with different disk capacities.
Replace the algorithm with a simpler one which counts data
disks and ignore parity disks.
fix: allow certain mutation on objects during decommission
currently by mistake deletion of objects was skipped,
if the object resided on the pool being decommissioned.
delete's are okay to be allowed since decommission is
designed to run on a cluster with active I/O.
Small uploads spend a significant amount of time (~5%) fetching disk info metrics. Also maps are allocated for each call.
Add a 100ms cache to disk metrics.
versioned buckets were not creating the delete markers
present in the versioned stack of an object, this essentially
would stop decommission to succeed.
This PR fixes creating such delete markers properly during
a decommissioning process, adds tests as well.
Current code incorrectly passed the
config asset object name while decommissioning,
make sure that we pass the right object name
to be hashed on the newer set of pools.
This PR fixes situations after a successful
decommission, the users and policies might go
missing due to wrong hashed set.
also use designated names for internal
calls
- storageREST calls are storageR
- lockREST calls are lockR
- peerREST calls are just peer
Named in this fashion to facilitate wildcard matches
by having prefixes of the same name.
Additionally, also enable funcNames for generic handlers
that return errors, currently we disable '<unknown>'
In a replicated setup, when an object is updated in one cluster but
still waiting to be replicated to the other cluster, GET requests with
if-match, and range headers will likely fail. It is better to proxy
requests instead.
Also, this commit avoids printing verbose logs about precondition &
range errors.
reedsolomon/cpuid would take a long time to start up on Xen VMs with
AMD processors due to a bug in the VM CPUID implementation.
Compression upgraded for better speed/compression.
fix: change timedvalue to return previous cached value
caller can interpret the underlying error and decide
accordingly, places where we do not interpret the
errors upon timedValue.Get() - we should simply use
the previously cached value instead of returning "empty".
Bonus: remove some unused code
Add a generic handler that adds a new tracing context to the request if
tracing is enabled. Other handlers are free to modify the tracing
context to update information on the fly, such as, func name, enable
body logging etc..
With this commit, requests like this
```
curl -H "Host: ::1:3000" http://localhost:9000/
```
will be traced as well.
Directories markers are not healed when healing a new fresh disk. A
a proper fix would be moving object names encoding/decoding to erasure
object level but it is too late now since the object to set distribution is
calculated at a higher level.
It is observed in a local 8 drive system the CPU seems to be
bottlenecked at
```
(pprof) top
Showing nodes accounting for 1385.31s, 88.47% of 1565.88s total
Dropped 1304 nodes (cum <= 7.83s)
Showing top 10 nodes out of 159
flat flat% sum% cum cum%
724s 46.24% 46.24% 724s 46.24% crypto/sha256.block
219.04s 13.99% 60.22% 226.63s 14.47% syscall.Syscall
158.04s 10.09% 70.32% 158.04s 10.09% runtime.memmove
127.58s 8.15% 78.46% 127.58s 8.15% crypto/md5.block
58.67s 3.75% 82.21% 58.67s 3.75% github.com/minio/highwayhash.updateAVX2
40.07s 2.56% 84.77% 40.07s 2.56% runtime.epollwait
33.76s 2.16% 86.93% 33.76s 2.16% github.com/klauspost/reedsolomon._galMulAVX512Parallel84
8.88s 0.57% 87.49% 11.56s 0.74% runtime.step
7.84s 0.5% 87.99% 7.84s 0.5% runtime.memclrNoHeapPointers
7.43s 0.47% 88.47% 22.18s 1.42% runtime.pcvalue
```
Bonus changes:
- re-use transport for bucket replication clients, also site replication clients.
- use 32KiB buffer for all read and writes at transport layer seems to help
TLS read connections.
- Do not have 'MaxConnsPerHost' this is problematic to be used with net/http
connection pooling 'MaxIdleConnsPerHost' is enough.
This commit fixes the order of elliptic curves.
As documented by https://pkg.go.dev/crypto/tls#Config
```
// CurvePreferences contains the elliptic curves that will be used in
// an ECDHE handshake, in preference order. If empty, the default will
// be used. The client will use the first preference as the type for
// its key share in TLS 1.3. This may change in the future.
```
In general, we should prefer `X25519` over the NIST curves.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
- Always reformat all disks when a new disk is detected, this will
ensure new uploads to be written in new fresh disks
- Always heal all buckets first when an erasure set started to be healed
- Use a lock to prevent two disks belonging to different nodes but in
the same erasure set to be healed in parallel
- Heal different sets in parallel
Bonus:
- Avoid logging errUnformattedDisk when a new fresh disk is inserted but
not detected by healing mechanism yet (10 seconds lag)
site replication errors were printed at
various random locations, repeatedly - this
PR attempts to remove double logging and
capture all of them at a common place.
This PR also enhances the code to show
partial success and errors as well.
`mc admin heal -r <alias>` in a multi setup pools returns incorrectly
grey objects. The reason is that erasure-server-pools.HealObject() runs
HealObject in all pools and returns the result of the first nil
error. However, in the lower erasureObject level, HealObject() returns
nil if an object does not exist + missing error in each disk of the object
in that pool, therefore confusing mc.
Make erasureObject.HealObject() to return not found error in the lower
level, so at least erasureServerPools will know what pools to ignore.
`config.ResolveConfigParam` returns the value of a configuration for any
subsystem based on checking env, config store, and default value. Also returns info
about which config source returned the value.
This is useful to return info about config params overridden via env in the user
APIs. Currently implemented only for OpenID subsystem, but will be extended for
others subsequently.
If sending a white space during a long S3 handler call fails,
the whitespace goroutine forgets to return a result to the caller.
Therefore, the complete multipart handler will be blocked.
Remember to send the header written result to the caller
or/and close the channel.
- currently subnet health check was freezing and calling
locks at multiple locations, avoid them.
- throw errors if first attempt itself fails with no results
Erasure SD DeleteObjects() is only inheriting bucket versioning status
from the handler layer.
Add the missing versioning prefix evaluation for each object that will
deleted.
PR #15052 caused a regression, add the missing metrics back.
Bonus:
- internode information should be only for distributed setups
- update the dashboard to include 4xx and 5xx error panels.
this allows for customers to use `mc admin service restart`
directly even when performing RPM, DEB upgrades. Upon such 'restart'
after upgrade MinIO will re-read the /etc/default/minio for any
newer environment variables.
As long as `MINIO_CONFIG_ENV_FILE=/etc/default/minio` is set, this
is honored.
Currently minio_s3_requests_errors_total covers 4xx and
5xx S3 responses which can be confusing when s3 applications
sent a lot of HEAD requests with obvious 404 responses or
when the replication is enabled.
Add
- minio_s3_requests_4xx_errors_total
- minio_s3_requests_5xx_errors_total
to help users monitor 4xx and 5xx HTTP status codes separately.
peerOnlineCounter was making NxN calls to many peers, this
can be really long and tedious if there are random servers
that are going down.
Instead we should calculate online peers from the point of
view of "self" and return those online and offline appropriately
by performing a healthcheck.
The 'go mod vendor' command generates a directory called
'vendor' in the main module's root directory, which includes
the required packages to support builds. Therefore, we can
include the 'vendor' directory in .gitignore completely,
regardless of any file extension.
* Add periodic callhome functionality
Periodically (every 24hrs by default), fetch callhome information and
upload it to SUBNET.
New config keys under the `callhome` subsystem:
enable - Set to `on` for enabling callhome. Default `off`
frequency - Interval between callhome cycles. Default `24h`
* Improvements based on review comments
- Update `enableCallhome` safely
- Rename pctx to ctx
- Block during execution of callhome
- Store parsed proxy URL in global subnet config
- Store callhome URL(s) in constants
- Use existing global transport
- Pass auth token to subnetPostReq
- Use `config.EnableOn` instead of `"on"`
* Use atomic package instead of lock
* Use uber atomic package
* Use `Cancel` instead of `cancel`
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
PR #15041 fixed replicating 'null' version however
due to a regression from #14994 caused the target
versions for these 'null' versioned objects to have
different 'versions', this may cause confusion with
bi-directional replication and cause double replication.
This PR fixes this properly by making sure we replicate
the correct versions on the objects.
mergeEntryChannels has the potential to perpetually
wait on the results channel, context might be closed
and we did not honor the caller context canceling.
The S3 service can be frozen indefinitely if a client or mc asks for object
perf API but quits early or has some networking issues. The reason is
that partialWrite() can block indefinitely.
This commit makes partialWrite() listens to context cancellation as well. It
also renames deadlinedCtx to healthCtx since it covers handler context
cancellation and not only not only the speedtest deadline.
In a streaming response, the client knows the size of a streamed
message but never checks the message size. Add the check to error
out if the response message is truncated.
Indexed streams would be decoded by the legacy loader if there
was an error loading it. Return an error when the stream is indexed
and it cannot be loaded.
Fixes "unknown minor metadata version" on corrupted xl.meta files and
returns an actual error.
We need to make sure if we cannot read bucket metadata
for some reason, and bucket metadata is not missing and
returning corrupted information we should panic such
handlers to disallow I/O to protect the overall state
on the system.
In-case of such corruption we have a mechanism now
to force recreate the metadata on the bucket, using
`x-minio-force-create` header with `PUT /bucket` API
call.
Additionally fix the versioning config updated state
to be set properly for the site replication healing
to trigger correctly.
readAllXL would return inlined data for outdated disks
causing "read" to return incorrect content to the client,
this PR fixes this behavior by making sure we skip such
outdated disks appropriately based on the latest ModTime
on the disk.
Main motivation is move towards a common backend format
for all different types of modes in MinIO, allowing for
a simpler code and predictable behavior across all features.
This PR also brings features such as versioning, replication,
transitioning to single drive setups.
Following code can reproduce an unending go-routine buildup,
while keeping connections established due to lack of client
not closing the connections.
https://gist.github.com/harshavardhana/2d00e6f909054d2d2524c71485ad02e1
Without this PR all MinIO deployments can be put into
denial of service attacks, causing entire service to be
unavailable.
We bring in two timeouts at this stage to control such
go-routine build ups, new change
- IdleTimeout (to kill off idle connections)
- ReadHeaderTimeout (to kill off connections that are too slow)
This new change also brings two hidden options to make any
additional relevant changes if desired in some setups.
It would seem like the PR #11623 had chewed more
than it wanted to, non-fips build shouldn't really
be forced to use slower crypto/sha256 even for
presumed "non-performance" codepaths. In MinIO
there are really no "non-performance" codepaths.
This assumption seems to have had an adverse
effect in certain areas of CPU usage.
This PR ensures that we stick to sha256-simd
on all non-FIPS builds, our most common build
to ensure we get the best out of the CPU at
any given point in time.
- Adds an STS API `AssumeRoleWithCustomToken` that can be used to
authenticate via the Id. Mgmt. Plugin.
- Adds a sample identity manager plugin implementation
- Add doc for plugin and STS API
- Add an example program using go SDK for AssumeRoleWithCustomToken
this PR also fixes a situation where incorrect
partsMetadata slice was used where fi.Data was
re-used from a single drive causing duplication
of the shards across all drives.
This happens for situations where shouldHeal()
returns true for all drives > parityBlocks.
To avoid this we should never attempt to heal on all
drives > parityBlocks, unless we are doing metadata
migration from xl.json -> xl.meta
If one or more pools reach 85% usage in a set, we will only
use pools that have more free space.
In case all pools are above 85% we allow all of them to be used
with the regular distribution.
When a server pool with a different number of sets is added they are
not compensated when choosing a destination pool for new objects.
This leads to the unbalanced placement of objects with smaller pools
getting a bigger number of objects since we only compare the destination
sets directly.
This change will compensate for differences in set sizes when choosing
the destination pool.
Different set sizes are already compensated by fewer disks.
updating metadata with CopyObject on a versioned bucket
causes the latest version to be not readable, this PR fixes
this properly by handling the inline data bug fix introduced
in PR #14780.
This bug affects only inlined data.
* Do not use inline data size in xl.meta quorum calculation
Data shards of one object can different inline/not-inline decision
in multiple disks. This happens with outdated disks when inline
decision changes. For example, enabling bucket versioning configuration
will change the small file threshold.
When the parity of an object becomes low, GET object can return 503
because it is not unable to calculate the xl.meta quorum, just because
some xl.meta has inline data and other are not.
So this commit will be disable taking the size of the inline data into
consideration when calculating the xl.meta quorum.
* Add tests for simulatenous inline/notinline object
Co-authored-by: Anis Elleuch <anis@min.io>
current implementation relied on recursively calling one bucket
at a time across all peers, this would be very slow and chatty
when there are 100's of buckets which would mean 100*peerCount
amount of network operations.
This PR attempts to reduce this entire call into `peerCount`
amount of network calls only. This functionality addresses also a
concern where the Prometheus metrics would significantly slow
down when one of the peers is offline.
Fix fallback hot loop
fd was never refreshed, leading to an infinite hot loop if a disk failed and the fallback disk fails as well.
Fix & simplify retry loop.
Fixes#14960
One usee reported having mc admin heal status output ETA increasing
by time. It turned out it is MRF that is not clearing its data due to a
bug in the code.
pendingItems is increased when an object is queued to be healed but
never decreasd when there is a healing error. This commit will decrease
pendingItems and pendingBytes even when there is an error to give
accurate reporting.
If LDAP is enabled, STS security token policy is evaluated using a
different code path and expects ldapUser claim to exist in the security
token. This means other STS temporary accounts generated by any Assume
Role function, such as AssumeRoleWithCertificate, won't be allowed to do any
operation as these accounts do not have LDAP user claim.
Since IsAllowedLDAPSTS() is similar to IsAllowedSTS(), this commit will
merge both.
Non harmful changes:
- IsAllowed for LDAP will start supporting RoleARN claim
- IsAllowed for LDAP will not check for parent claim anymore. This check doesn't
seem to be useful since all STS login compare access/secret/security-token
with the one saved in the disk.
- LDAP will support $username condition in policy documents.
Co-authored-by: Anis Elleuch <anis@min.io>
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
.Reset() documentation states:
For a Timer created with NewTimer, Reset should be invoked only on stopped
or expired timers with drained channels.
This change is just to comply with this requirement as there might be some
runtime dependent situation that might lead to unexpected behavior.
it seems in some places we have been wrongly using the
timer.Reset() function, nicely exposed by an example
shared by @donatello https://go.dev/play/p/qoF71_D1oXD
this PR fixes all the usage comprehensively
anything that is stuck on the disk today can cause latency
spikes for all incoming S3 I/O, we need to have this
de-coupled so that we can make sure that latency in loading
credentials are not reflected back to the S3 API calls.
The approach this PR takes is by checking if the calls were
updated just in case when the IAM load was in progress,
so that we can use merge instead of "replacement" to avoid
missing state.
When configuring a new target, such as an audit target, the server waits
until all audit events are sent to the audit target before doing the
swap from the old to the new audit target. Therefore current S3 operations
can suffer from this since the audit swap lock will be held.
This behavior is unnecessary as the new audit target can enter in a
functional mode immediately and the old audit will just cancel itself
at its own pace.
Object tags can have special characters such as whitespace. However
the current code doesn't properly consider those characters while
evaluating the lifecycle document.
ObjectInfo.UserTags contains an url encoded form of object tags
(e.g. key+1=val)
This commit fixes the issue by using the tags package to parse object tags.
The test expects from DeleteFile to return errDiskNotFound when the disk
is not available. It calls os.RemoveAll() to remove one disk after XL storage
initialization. However, this latter contains some goroutines which can
race with os.RemoveAll() and then the test fails sporadically with
returning random errors.
The commit will tweak the initialization routine of the XL storage to
only run deletion of temporary and metacache data in the background,
so TestXLStorageDeleteFile won't fail anymore.
currently, we allowed buckets to be listed from the
API call if and when the user has ListObject()
permission at the global level, this is okay to be
extended to GetBucketLocation() as well since
GetBucketLocation() is a "read" call and allowing "reads"
on a bucket has an implicit assumption that ListBuckets()
should be allowed.
This makes discoverability of access for read-only users
becomes easier or users with specific restrictions on their
policies.
This PR simplifies few things by splitting
the locks between audit, logger targets to
avoid potential contention between them.
any failures inside audit/logger HTTP
targets must only log to console instead
of other targets to avoid cyclical dependency.
avoids unneeded atomic variables instead
uses RWLock to differentiate a more common
read phase v/s lock phase.
- This change renames the OPA integration as Access Management Plugin - there is
nothing specific to OPA in the integration, it is just a webhook.
- OPA configuration is automatically migrated to Access Management Plugin and
OPA specific configuration is marked as deprecated.
- OPA doc is updated and moved.
In case of multi-pools setup, GetObjectNInfo returns a GetObjectReader
but it unlocks the read lock when quitting GetObjectNInfo. This should
not happen, unlock should only happen when GetObjectReader is closed.
- do not need to restrict prefix exclusions that do not
have `/` as suffix, relax this requirement as spark may
have staging folders with other autogenerated characters
, so we are better off doing full prefix March and skip.
- multiple delete objects was incorrectly creating a
null delete marker on a versioned bucket instead of
creating a proper versioned delete marker.
- do not suspend paths on the excluded prefixes during
delete operations to avoid creating `null` delete markers,
honor suspension of versioning only at bucket level for
delete markers.
PR #14828 introduced prefix-level exclusion of versioning
and replication - however our site replication implementation
since it defaults versioning on all buckets did not allow
changing versioning configuration once the bucket was created.
This PR changes this and ensures that such changes are honored
and also propagated/healed across sites appropriately.
Spark/Hadoop workloads which use Hadoop MR
Committer v1/v2 algorithm upload objects to a
temporary prefix in a bucket. These objects are
'renamed' to a different prefix on Job commit.
Object storage admins are forced to configure
separate ILM policies to expire these objects
and their versions to reclaim space.
Our solution:
This can be avoided by simply marking objects
under these prefixes to be excluded from versioning,
as shown below. Consequently, these objects are
excluded from replication, and don't require ILM
policies to prune unnecessary versions.
- MinIO Extension to Bucket Version Configuration
```xml
<VersioningConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Status>Enabled</Status>
<ExcludeFolders>true</ExcludeFolders>
<ExcludedPrefixes>
<Prefix>app1-jobs/*/_temporary/</Prefix>
</ExcludedPrefixes>
<ExcludedPrefixes>
<Prefix>app2-jobs/*/__magic/</Prefix>
</ExcludedPrefixes>
<!-- .. up to 10 prefixes in all -->
</VersioningConfiguration>
```
Note: `ExcludeFolders` excludes all folders in a bucket
from versioning. This is required to prevent the parent
folders from accumulating delete markers, especially
those which are shared across spark workloads
spanning projects/teams.
- To enable version exclusion on a list of prefixes
```
mc version enable --excluded-prefixes "app1-jobs/*/_temporary/,app2-jobs/*/_magic," --exclude-prefix-marker myminio/test
```
when the site is being removed is missing replication config. This can
happen when a new deployment is brought in place of a site that
is lost/destroyed and needs to delink old deployment from site
replication.
console logging peer API was broken as it would
timeout after 15minutes, this never really worked
beyond this value and basically failed to provide
the streaming "log" functionality that was expected
from this implementation.
also fix convoluted channel handling by keeping things
simple, this is rewritten.
do not modify opts.UserDefined after object-handler
has set all the necessary values, any mutation needed
should be done on a copy of this value not directly.
As there are other pieces of code that access opts.UserDefined
concurrently this becomes challenging.
fixes#14856
When a decommission task is successfully completed, failed, or canceled,
this commit allows restarting the decommission again. Restarting is not
allowed when there is an ongoing decommission task.
this PR introduces a few changes such as
- sessionPolicyName is not reused in an extracted manner
to apply policies for incoming authenticated calls,
instead uses a different key to designate this
information for the callers.
- this differentiation is needed to ensure that service
account updates do not accidentally store JSON representation
instead of base64 equivalent on the disk.
- relax requirements for Deleting a service account, allow
deleting a service account that might be unreadable, i.e
a situation where the user might have removed session policy
which now carries a JSON representation, making it unparsable.
- introduce some constants to reuse instead of strings.
fixes#14784
If an invalid status code is generated from an error we risk panicking. Even if there
are no potential problems at the moment we should prevent this in the future.
Add safeguards against this.
Sample trace:
```
May 02 06:41:39 minio[52806]: panic: "GET /20180401230655.PDF": invalid WriteHeader code 0
May 02 06:41:39 minio[52806]: goroutine 16040430822 [running]:
May 02 06:41:39 minio[52806]: runtime/debug.Stack(0xc01fff7c20, 0x25c4b00, 0xc0490e4080)
May 02 06:41:39 minio[52806]: runtime/debug/stack.go:24 +0x9f
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd.setCriticalErrorHandler.func1.1(0xc022048800, 0x4f38ab0, 0xc0406e0fc0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd/generic-handlers.go:469 +0x85
May 02 06:41:39 minio[52806]: panic(0x25c4b00, 0xc0490e4080)
May 02 06:41:39 minio[52806]: runtime/panic.go:965 +0x1b9
May 02 06:41:39 minio[52806]: net/http.checkWriteHeaderCode(...)
May 02 06:41:39 minio[52806]: net/http/server.go:1092
May 02 06:41:39 minio[52806]: net/http.(*response).WriteHeader(0xc0406e0fc0, 0x0)
May 02 06:41:39 minio[52806]: net/http/server.go:1126 +0x718
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger.(*ResponseWriter).WriteHeader(0xc032fa3ea0, 0x0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger/audit.go:116 +0xb1
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger.(*ResponseWriter).WriteHeader(0xc032fa3f40, 0x0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger/audit.go:116 +0xb1
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger.(*ResponseWriter).WriteHeader(0xc002ce8000, 0x0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger/audit.go:116 +0xb1
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd.writeResponse(0x4f364a0, 0xc002ce8000, 0x0, 0xc0443b86c0, 0x1cb, 0x224, 0x2a9651e, 0xf)
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd/api-response.go:736 +0x18d
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd.writeErrorResponse(0x4f44218, 0xc069086ae0, 0x4f364a0, 0xc002ce8000, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00656afc0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd/api-response.go:798 +0x306
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd.objectAPIHandlers.getObjectHandler(0x4b73768, 0x4b73730, 0x4f44218, 0xc069086ae0, 0x4f82090, 0xc002d80620, 0xc040e03885, 0xe, 0xc040e03894, 0x61, ...)
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd/object-handlers.go:456 +0x252c
```
space characters at the beginning or at the end can lead to
confusion under various UI elements in differentiating the
actual name of "policy, user or group" - to avoid this behavior
this PR onwards we shall reject such inputs for newer entries.
existing saved entries will behave as is and are going to be
operable until they are removed/renamed to something more
meaningful.
- When using multiple providers, claim-based providers are not allowed. All
providers must use role policies.
- Update markdown config to allow `details` HTML element
this is allowed as long as order is preserved as is
on an existing setup, the new command line is updated
in `pool.bin` to facilitate future decommission's on
these pools.
introduce x-minio-force-create environment variable
to force create a bucket and its metadata as required,
it is useful in some situations when bucket metadata
needs recovery.
improvements in this PR include
- decommission objects that have __XLDIR__ suffix
- decommission objects that have `null` version on
a versioned bucket.
- make sure to look for any "decom" failures to ensure
that we do not wrong conclude decom as complete without
all files getting copied over.
- break out eagerly upon first error for objects with
multiple versions, leave the object as is for support
debugging and analysis.
heal bucket metadata and IAM entries for
sites participating in site replication from
the site with the most updated entry.
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Aditya Manthramurthy <aditya@minio.io>
The site replication status call was using a loop iteration variable sent
directly into go-routines instead of being passed as an argument. As the
variable is being updated in the loop, previously launched go routines do not
necessarily use the value at the time they were launched.
This PR fixes two issues
- The first fix is a regression from #14555, the fix itself in #14555
is correct but the interpretation of that information by the
object layer code for "replication" was not correct. This PR
tries to fix this situation by making sure the "Delete" replication
works as expected when "VersionPurgeStatus" is already set.
Without this fix, there is a DELETE marker created incorrectly on
the source where the "DELETE" was triggered.
- The second fix is perhaps an older problem started since we inlined-data
on the disk for small objects, CopyObject() incorrectly inline's
a non-inlined data. This is due to the fact that we have code where
we read the `part.1` under certain conditions where the size of the
`part.1` is less than the specific "threshold".
This eventually causes problems when we are "deleting" the data that
is only inlined, which means dataDir is ignored leaving such
dataDir on the disk, that looks like an inconsistent content on
the namespace.
fixes#14767
It is wasteful to allow parallel upgrades of MinIO server. This also generates
weird error invoked by selfupdate module when it happens such as:
'rename /opt/bin/.minio.old /opt/bin/..minio.old.old'
currently filterPefix was never used and set
that would filter out entries when needed
when `prefix` doesn't end with `/` - this
often leads to objects getting Walked(), Healed()
that were never requested by the caller.
without this wait there is a potential for some objects
that are in actively being decommissioned would cancel,
however the decommission status might wrongly conclude
this as "Complete".
To avoid this make sure to add waitgroups on the parallel
workers, allowing parallel copies to complete fully before
we return.
In previous releases, mc admin user list would return the list of users
that have policies mapped in IAM database. However, this was removed but
this commit will bring it back until we revamp this.
- This change switches to a new parquet library
- SelectObjectContent now takes a single lock at the beginning and holds it
during the operation. Previously the operation took a lock every time the
parquet library performed a Seek on the underlying object stream.
- Add basic support for LogicalType annotations for timestamps.
Execute the object, drive and net speedtests as part of the healthinfo
(if requested by the client), and include their result in the response.
The options for the speedtests have been picked from the default values
used by `mc support perf` command.
This commit improves the listing of encrypted objects:
- Use `etag.Format` and `etag.Decrypt`
- Detect SSE-S3 single-part objects in a single iteration
- Fix batch size to `250`
- Pass request context to `DecryptAll` to not waste resources
when a client cancels the operation.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds two new functions to the
internal `etag` package:
- `ETag.Format`
- `Decrypt`
The `Decrypt` function decrypts an encrypted
ETag using a decryption key. It returns not
encrypted / multipart ETags unmodified.
The `Decrypt` function is mainly used when
handling SSE-S3 encrypted single-part objects.
In particular, the ETag of an SSE-S3 encrypted
single-part object needs to be decrypted since
S3 clients expect that this ETag is equal to the
content MD5.
The `ETag.Format` method also covers SSE ETag handling.
MinIO encrypts all ETags of SSE single part objects.
However, only the ETag of SSE-S3 encrypted single part
objects needs to be decrypted.
The ETag of an SSE-C or SSE-KMS single part object
does not correspond to its content MD5 and can be
a random value.
The `ETag.Format` function formats an ETag such that
it is an AWS S3 compliant ETag. In particular, it
returns non-encrypted ETags (single / multipart)
unmodified. However, for encrypted ETags it returns
the trailing 16 bytes as ETag. For encrypted ETags
the last 16 bytes will be a random value.
The main purpose of `Format` is to format ETags
such that clients accept them as well-formed AWS S3
ETags.
It differs from the `String` method since `String`
will return string representations for encrypted
ETags that are not AWS S3 compliant.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
The deployment id was being written to the health report towards the end
of the handler. Because of this, if there was a timeout in any of the
data fetching, the deployment id was not getting written at all. Upload
of such reports fails on SUBNET as deployment id is the unique
identifier for a cluster in subnet.
Fixed by writing the deployment id at the beginning of the processing.
This commit simplifies the ETag decryption and size adjustment
when listing object parts.
When listing object parts, MinIO has to decrypt the ETag of all
parts if and only if the object resp. the parts is encrypted using
SSE-S3.
In case of SSE-KMS and SSE-C, MinIO returns a pseudo-random ETag.
This is inline with AWS S3 behavior.
Further, MinIO has to adjust the size of all encrypted parts due to
the encryption overhead.
The ListObjectParts does specifically not use the KMS bulk decryption
API (4d2fc530d0) since the ETags of all
parts are encrypted using the same object encryption key. Therefore,
MinIO only has to connect to the KMS once, even if there are multiple
parts resp. ETags. It can simply reuse the same object encryption key.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds support for encrypted KES
client private keys.
Now, it is possible to encrypt the KES client
private key (`MINIO_KMS_KES_KEY_FILE`) with
a password.
For example, KES CLI already supports the
creation of encrypted private keys:
```
kes identity new --encrypt --key client.key --cert client.crt MinIO
```
To decrypt an encrypted private key, the password
needs to be provided:
```
MINIO_KMS_KES_KEY_PASSWORD=<password>
```
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit optimises the ETag decryption when
listing objects.
When MinIO lists objects, it has to decrypt the
ETags of single-part SSE-S3 objects.
It does not need to decrypt ETags of
- plaintext objects => Their ETag is not encrypted
- SSE-C objects => Their ETag is not the content MD5
- SSE-KMS objects => Their ETag is not the content MD5
- multipart objects => Their ETag is not encrypted
Hence, MinIO only needs to make a call to the KMS
when it needs to decrypt a single-part SSE-S3 object.
It can resolve the ETags off all other object types
locally.
This commit implements the above semantics by
processing an object listing in batches.
If the batch contains no single-part SSE-S3 object,
then no KMS calls will be made.
If the batch contains at least one single-part
SSE-S3 object we have to make at least one KMS call.
No we first filter all single-part SSE-S3 objects
such that we only request the decryption keys for
these objects.
Once we know which objects resp. ETags require a
decryption key, MinIO either uses the KES bulk
decryption API (if supported) or decrypts each
ETag serially.
This commit is a significant improvement compared
to the previous listing code. Before, a single
non-SSE-S3 object caused MinIO to fall-back to
a serial ETag decryption.
For example, if a batch consisted of 249 SSE-S3
objects and one single SSE-KMS object, MinIO would
send 249 requests to the KMS.
Now, MinIO will send a single request for exactly
those 249 objects and skip the one SSE-KMS object
since it can handle its ETag locally.
Further, MinIO would request decryption keys
for SSE-S3 multipart objects in the past - even
though multipart ETags are not encrypted.
So, if a bucket contained only multipart SSE-S3
objects, MinIO would make totally unnecessary
requests to the KMS.
Now, MinIO simply skips these multipart objects
since it can handle the ETags locally.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
In bulk ETag decryption, do not rely on the etag to check if it is
encrypted or not to decide if we should set the actual object size in
ObjectInfo. The reason is that multipart objects ETags are not
encrypted.
Always get the actual object size in that case.
This commit fixes a subtle bug in the ETag
`IsEncrypted` implementation.
An encrypted ETag may contain random bytes,
i.e. some randomness used for encryption.
This random value can contain a '-' byte
simple due to being randomly generated.
Before, the `IsEncrypted` implementation
incorrectly assumed that an encrypted ETag
cannot contain a '-' since it would be a
multipart ETag. Multipart ETags have a
16 byte value followed by a '-' and the part number.
For example:
```
059ba80b807c3c776fb3bcf3f33e11ae-2
```
However, the following encrypted ETag
```
20000f00db2d90a7b40782d4cff2b41a7799fc1e7ead25972db65150118dfbe2ba76a3c002da28f85c840cd2001a28a9
```
also contains a '-' byte but is not a multipart ETag.
This commit fixes the `IsEncrypted` implementation
simply by checking whether the ETag is at least 32
bytes long. A valid multipart ETag is never 32 bytes
long since a part number must be <= 10000.
However, an encrypted ETag must be at least 32 bytes
long. It contains the encrypted ETag bytes (16 bytes)
and the authentication tag added by the AEAD cipher (again
16 bytes).
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds support for bulk ETag
decryption for SSE-S3 encrypted objects.
If KES supports a bulk decryption API, then
MinIO will check whether its policy grants
access to this API. If so, MinIO will use
a bulk API call instead of sending encrypted
ETags serially to KES.
Note that MinIO will not use the KES bulk API
if its client certificate is an admin identity.
MinIO will process object listings in batches.
A batch has a configurable size that can be set
via `MINIO_KMS_KES_BULK_API_BATCH_SIZE=N`.
It defaults to `500`.
This env. variable is experimental and may be
renamed / removed in the future.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
ListObjects, ListObjectsV2 calls are being heavily taxed when
there are many versions on objects left over from a previous
release or ILM was never setup to clean them up. Instead
of being absolutely correct at resolving the exact latest
version of an object, we simply rely on the top most 1
version and resolve the rest.
Once we have obtained the top most "1" version for
ListObject, ListObjectsV2 call we break out.
In riscv64, the `syscall.Uname` function will return a uint8 slice.
func main() {
var buf syscall.Utsname
fmt.Printf("Buffer Type: %T\n", buf.Release)
}
output:
Buffer Type: [65]uint8
This is tested in the Arch Linux RISC-V 64 QEMU environment.
Signed-off-by: Avimitin <avimitin@gmail.com>
For ListObjects and ListObjectsV2 perform lifecycle checks on
all objects before returning. This will filter out objects that are
pending lifecycle expiration.
Bonus: Cheaper server pool conflict resolution by not converting to FileInfo.
When reloading a dynamic config allow the request pool to scale both ways.
Existing requests hold on to the previous pool, so they will pop the elements from that.
currently an on-going decommission, during a server
restart might block the startup sequence for relatively
longer periods, instead start the decommission in
background lazily.
This commit fixes two bugs in the `PutObjectPartHandler`.
First, `PutObjectPart` should return SSE-KMS headers
when the object is encrypted using SSE-KMS.
Before, this was not the case.
Second, the ETag should always be a 16 byte hex string,
perhaps followed by a `-X` (where `X` is the number of parts).
However, `PutObjectPart` used to return the encrypted ETag
in case of SSE-KMS. This leaks MinIO internal etag details
through the S3 API.
The combination of both bugs causes clients that use SSE-KMS
to fail when trying to validate the ETag. Since `PutObjectPart`
did not send the SSE-KMS response headers, the response looked
like a plaintext `PutObjectPart` response. Hence, the client
tries to verify that the ETag is the content-md5 of the part.
This could never be the case, since MinIO used to return the
encrypted ETag.
Therefore, clients behaving as specified by the S3 protocol
tried to verify the ETag in a situation they should not.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Fix `panic: "POST /minio/peer/v21/signalservice?signal=2": sync: WaitGroup is reused before previous Wait has returned`
Log entries already on the channel would cause `logEntry` to increment the
waitgroup when sending messages, after Cancel has been called.
Instead of tracking every single message, just check the send goroutine. Faster
and safe, since it will not decrement until the channel is closed.
Regression from #14289
When more than 2 disks are unavailable for listing, the same disk will be used for fallback.
This makes quorum calculations incorrect since the same disk will have multiple entries.
This PR keeps track of which fallback disks have been handed out and only every returns a disk once.
avoids creating new transport for each `isServerResolvable`
request, instead re-use the available global transport and do
not try to forcibly close connections to avoid TIME_WAIT
build upon large clusters.
Never use httpClient.CloseIdleConnections() since that can have
a drastic effect on existing connections on the transport pool.
Remove it everywhere.
FROM registry.access.redhat.com/ubi8/ubi-micro:latest
ARG RELEASE
LABELname="MinIO"\
vendor="MinIO Inc <dev@min.io>"\
maintainer="MinIO Inc <dev@min.io>"\
version="${RELEASE}"\
release="${RELEASE}"\
summary="MinIO is a High Performance Object Storage, API compatible with Amazon S3 cloud storage service."\
description="MinIO object storage is fundamentally different. Designed for performance and the S3 API, it is 100% open-source. MinIO is ideal for large, private cloud environments with stringent security requirements and delivers mission-critical availability across a diverse range of workloads."
test-replication:install-racetest-replication-2sitetest-replication-3sitetest-delete-replicationtest-sio-errortest-delete-marker-proxying## verify multi site replication
@echo "Running tests for replicating three sites"
test-site-replication-ldap:install-race## verify automatic site replication
@echo "Running tests for automatic site replication of IAM (with LDAP)"
MinIO creates FIPS builds using a patched version of the Go compiler (that uses BoringCrypto, from BoringSSL, which is [FIPS 140-2 validated](https://csrc.nist.gov/csrc/media/projects/cryptographic-module-validation-program/documents/security-policies/140sp2964.pdf)) published by the Golang Team [here](https://github.com/golang/go/tree/dev.boringcrypto/misc/boring).
MinIO FIPS executables are available at <http://dl.min.io> - they are only published for `linux-amd64` architecture as binary files with the suffix `.fips`. We also publish corresponding container images to our official image repositories.
We are not making any statements or representations about the suitability of this code or build in relation to the FIPS 140-2 standard. Interested users will have to evaluate for themselves whether this is useful for their own purposes.
@@ -14,7 +14,7 @@ Use the following commands to run a standalone MinIO server as a container.
Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication
require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically,
with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Quickstart Guide](https://docs.min.io/docs/minio-erasure-code-quickstart-guide.html)
with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Overview](https://min.io/docs/minio/linux/operations/concepts/erasure-coding.html)
for more complete documentation.
### Stable
@@ -32,7 +32,7 @@ root credentials. You can use the Browser to create buckets, upload objects, and
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See
[Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers,
see <https://docs.min.io/docs/> and click **MinIO SDKs** in the navigation to view MinIO SDKs for supported languages.
see <https://min.io/docs/minio/linux/developers/minio-drivers.html> to view MinIO SDKs for supported languages.
> NOTE: To deploy MinIO on with persistent storage, you must map local persistent directories from the host OS to the container using the `podman -v` option. For example, `-v /mnt/data:/data` maps the host OS drive at `/mnt/data` to `/data` on the container.
@@ -40,7 +40,7 @@ see <https://docs.min.io/docs/> and click **MinIO SDKs** in the navigation to vi
Use the following commands to run a standalone MinIO server on macOS.
Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically, with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Quickstart Guide](https://docs.min.io/docs/minio-erasure-code-quickstart-guide.html) for more complete documentation.
Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically, with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Overview](https://min.io/docs/minio/linux/operations/concepts/erasure-coding.html) for more complete documentation.
### Homebrew (recommended)
@@ -60,7 +60,7 @@ brew install minio/stable/minio
The MinIO deployment starts using default root credentials `minioadmin:minioadmin`. You can test the deployment using the MinIO Console, an embedded web-based object browser built into MinIO Server. Point a web browser running on the host machine to <http://127.0.0.1:9000> and log in with the root credentials. You can use the Browser to create buckets, upload objects, and browse the contents of the MinIO server.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://docs.min.io/docs/> and click **MinIO SDKs** in the navigation to view MinIO SDKs for supported languages.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://min.io/docs/minio/linux/developers/minio-drivers.html/> to view MinIO SDKs for supported languages.
### Binary Download
@@ -74,7 +74,7 @@ chmod +x minio
The MinIO deployment starts using default root credentials `minioadmin:minioadmin`. You can test the deployment using the MinIO Console, an embedded web-based object browser built into MinIO Server. Point a web browser running on the host machine to <http://127.0.0.1:9000> and log in with the root credentials. You can use the Browser to create buckets, upload objects, and browse the contents of the MinIO server.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://docs.min.io/docs/> and click **MinIO SDKs** in the navigation to view MinIO SDKs for supported languages.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://min.io/docs/minio/linux/developers/minio-drivers.html> to view MinIO SDKs for supported languages.
## GNU/Linux
@@ -86,8 +86,6 @@ chmod +x minio
./minio server /data
```
Replace ``/data`` with the path to the drive or directory in which you want MinIO to store data.
The following table lists supported architectures. Replace the `wget` URL with the architecture for your Linux host.
| Architecture | URL |
@@ -99,9 +97,9 @@ The following table lists supported architectures. Replace the `wget` URL with t
The MinIO deployment starts using default root credentials `minioadmin:minioadmin`. You can test the deployment using the MinIO Console, an embedded web-based object browser built into MinIO Server. Point a web browser running on the host machine to <http://127.0.0.1:9000> and log in with the root credentials. You can use the Browser to create buckets, upload objects, and browse the contents of the MinIO server.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://docs.min.io/docs/> and click **MinIO SDKs** in the navigation to view MinIO SDKs for supported languages.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://min.io/docs/minio/linux/developers/minio-drivers.html> to view MinIO SDKs for supported languages.
> NOTE: Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically, with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Quickstart Guide](https://docs.min.io/docs/minio-erasure-code-quickstart-guide.html) for more complete documentation.
> NOTE: Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically, with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Overview](https://min.io/docs/minio/linux/operations/concepts/erasure-coding.html#) for more complete documentation.
## Microsoft Windows
@@ -119,23 +117,23 @@ minio.exe server D:\
The MinIO deployment starts using default root credentials `minioadmin:minioadmin`. You can test the deployment using the MinIO Console, an embedded web-based object browser built into MinIO Server. Point a web browser running on the host machine to <http://127.0.0.1:9000> and log in with the root credentials. You can use the Browser to create buckets, upload objects, and browse the contents of the MinIO server.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://docs.min.io/docs/> and click **MinIO SDKs** in the navigation to view MinIO SDKs for supported languages.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://min.io/docs/minio/linux/developers/minio-drivers.html> to view MinIO SDKs for supported languages.
> NOTE: Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically, with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Quickstart Guide](https://docs.min.io/docs/minio-erasure-code-quickstart-guide.html) for more complete documentation.
> NOTE: Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically, with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Overview](https://min.io/docs/minio/linux/operations/concepts/erasure-coding.html#) for more complete documentation.
## Install from Source
Use the following commands to compile and run a standalone MinIO server from source. Source installation is only intended for developers and advanced users. If you do not have a working Golang environment, please follow [How to install Golang](https://golang.org/doc/install). Minimum version required is [go1.17](https://golang.org/dl/#stable)
Use the following commands to compile and run a standalone MinIO server from source. Source installation is only intended for developers and advanced users. If you do not have a working Golang environment, please follow [How to install Golang](https://golang.org/doc/install). Minimum version required is [go1.21](https://golang.org/dl/#stable)
```sh
GO111MODULE=on go install github.com/minio/minio@latest
go install github.com/minio/minio@latest
```
The MinIO deployment starts using default root credentials `minioadmin:minioadmin`. You can test the deployment using the MinIO Console, an embedded web-based object browser built into MinIO Server. Point a web browser running on the host machine to <http://127.0.0.1:9000> and log in with the root credentials. You can use the Browser to create buckets, upload objects, and browse the contents of the MinIO server.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://docs.min.io/docs/> and click **MinIO SDKs** in the navigation to view MinIO SDKs for supported languages.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://min.io/docs/minio/linux/developers/minio-drivers.html> to view MinIO SDKs for supported languages.
> NOTE: Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically, with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Quickstart Guide](https://docs.min.io/docs/minio-erasure-code-quickstart-guide.html) for more complete documentation.
> NOTE: Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically, with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Overview](https://min.io/docs/minio/linux/operations/concepts/erasure-coding.html) for more complete documentation.
MinIO strongly recommends *against* using compiled-from-source MinIO servers for production environments.
When deployed on a single drive, MinIO server lets clients access any pre-existing data in the data directory. For example, if MinIO is started with the command `minio server /mnt/data`, any pre-existing data in the `/mnt/data` directory would be accessible to the clients.
The above statement is also valid for all gateway backends.
## Test MinIO Connectivity
### Test using MinIO Console
MinIO Server comes with an embedded web based object browser. Point your web browser to <http://127.0.0.1:9000> to ensure your server has started successfully.
> NOTE: MinIO runs console on random port by default if you wish choose a specific port use `--console-address` to pick a specific interface and port.
> NOTE: MinIO runs console on random port by default, if you wish to choose a specific port use `--console-address` to pick a specific interface and port.
### Things to consider
@@ -218,17 +210,13 @@ For deployments behind a load balancer, proxy, or ingress rule where the MinIO h
For example, consider a MinIO deployment behind a proxy `https://minio.example.net`, `https://console.minio.example.net` with rules for forwarding traffic on port :9000 and :9001 to MinIO and the MinIO Console respectively on the internal network. Set `MINIO_BROWSER_REDIRECT_URL` to `https://console.minio.example.net` to ensure the browser receives a valid reachable URL.
Similarly, if your TLS certificates do not have the IP SAN for the MinIO server host, the MinIO Console may fail to validate the connection to the server. Use the `MINIO_SERVER_URL` environment variable and specify the proxy-accessible hostname of the MinIO server to allow the Console to use the MinIO server API using the TLS certificate.
For example: `export MINIO_SERVER_URL="https://minio.example.net"`
`mc` provides a modern alternative to UNIX commands like ls, cat, cp, mirror, diff etc. It supports filesystems and Amazon S3 compatible cloud storage services. Follow the MinIO Client [Quickstart Guide](https://docs.min.io/docs/minio-client-quickstart-guide) for further instructions.
`mc` provides a modern alternative to UNIX commands like ls, cat, cp, mirror, diff etc. It supports filesystems and Amazon S3 compatible cloud storage services. Follow the MinIO Client [Quickstart Guide](https://min.io/docs/minio/linux/reference/minio-mc.html#quickstart) for further instructions.
## Upgrading MinIO
@@ -236,32 +224,30 @@ Upgrades require zero downtime in MinIO, all upgrades are non-disruptive, all tr
> NOTE: requires internet access to update directly from <https://dl.min.io>, optionally you can host any mirrors at <https://my-artifactory.example.com/minio/>
- For deployments that installed the MinIO server binary by hand, use [`mc admin update`](https://docs.min.io/minio/baremetal/reference/minio-mc-admin/mc-admin-update.html)
- For deployments that installed the MinIO server binary by hand, use [`mc admin update`](https://min.io/docs/minio/linux/reference/minio-mc-admin/mc-admin-update.html)
```sh
mc admin update <minio alias, e.g., myminio>
```
- For deployments without external internet access (e.g. airgapped environments), download the binary from <https://dl.min.io> and replace the existing MinIO binary let's say for example `/opt/bin/minio`, apply executable permissions `chmod +x /opt/bin/minio` and do `mc admin service restart alias/`.
- For deployments without external internet access (e.g. airgapped environments), download the binary from <https://dl.min.io> and replace the existing MinIO binary let's say for example `/opt/bin/minio`, apply executable permissions `chmod +x /opt/bin/minio` and proceed to perform `mc admin service restart alias/`.
- For RPM/DEB installations, upgrade packages **parallelly** on all servers. Once upgraded, perform `systemctl restart minio` across all nodes in **parallel**. RPM/DEB based installations are usually automated using [`ansible`](https://github.com/minio/ansible-minio).
- For installations using Systemd MinIO service, upgrade via RPM/DEB packages **parallelly** on all servers or replace the binary lets say `/opt/bin/minio` on all nodes, apply executable permissions `chmod +x /opt/bin/minio` and process to perform `mc admin service restart alias/`.
### Upgrade Checklist
- Test all upgrades in a lower environment (DEV, QA, UAT) before applying to production. Performing blind upgrades in production environments carries significant risk.
- Read the release notes for the targeted MinIO release *before* performing any installation, there is no forced requirement to upgrade to latest releases every week. If it has a bug fix you are looking for then yes, else avoid actively upgrading a running production system.
- Make sure MinIO process has write access to `/opt/bin` if you plan to use `mc admin update`. This is needed for MinIO to download the latest binary from <https://dl.min.io> and save it locally for upgrades.
- `mc admin update` is not supported in kubernetes/container environments, container environments provide their own mechanisms for container updates.
- Read the release notes for MinIO *before* performing any upgrade, there is no forced requirement to upgrade to latest release upon every release. Some release may not be relevant to your setup, avoid upgrading production environments unnecessarily.
- If you plan to use `mc admin update`, MinIO process must have write access to the parent directory where the binary is present on the host system.
- `mc admin update` is not supported and should be avoided in kubernetes/container environments, please upgrade containers by upgrading relevant container images.
- **We do not recommend upgrading one MinIO server at a time, the product is designed to support parallel upgrades please follow our recommended guidelines.**
"${MINIO[@]}" server "${WORK_DIR}/fs-disk" >"$WORK_DIR/fs-minio.log" 2>&1&
sleep 10
function start_minio_fs(){
exportMINIO_ROOT_USER=$ACCESS_KEY
exportMINIO_ROOT_PASSWORD=$SECRET_KEY
"${MINIO[@]}" server "${WORK_DIR}/fs-disk" >"$WORK_DIR/fs-minio.log" 2>&1&
"${WORK_DIR}/mc" ready verify
}
function start_minio_erasure()
{
"${MINIO[@]}" server "${WORK_DIR}/erasure-disk1""${WORK_DIR}/erasure-disk2""${WORK_DIR}/erasure-disk3""${WORK_DIR}/erasure-disk4" >"$WORK_DIR/erasure-minio.log" 2>&1&
sleep 15
function start_minio_erasure(){
"${MINIO[@]}" server "${WORK_DIR}/erasure-disk1""${WORK_DIR}/erasure-disk2""${WORK_DIR}/erasure-disk3""${WORK_DIR}/erasure-disk4" >"$WORK_DIR/erasure-minio.log" 2>&1&
File diff suppressed because one or more lines are too long
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.