Commit Graph

187 Commits

Author SHA1 Message Date
Harshavardhana
f965434022 fix: re-use endpoint strings to avoid allocation during audit (#19116) 2024-02-23 16:19:13 -08:00
Harshavardhana
f961ec4aaf fix: revert allow offline disks on fresh start (#19052)
the PR in #16541 was incorrect and hand wrong assumptions
about the overall setup, revert this since this expectation
to have offline servers is wrong and we can end up with a
bigger chicken and egg problem.

This reverts commit 5996c8c4d5.

Bonus:

- preserve disk in globalLocalDrives properly upon connectDisks()
- do not return 'nil' from newXLStorage(), getting it ready for
  the next set of changes for 'format.json' loading.
2024-02-14 10:37:34 -08:00
Harshavardhana
eac4e4b279 honor replaced disk properly by updating globalLocalDrives (#19038)
globalLocalDrives seem to be not updated during the
HealFormat() leads to a requirement where the server
needs to be restarted for the healing to continue.
2024-02-12 13:00:20 -08:00
Harshavardhana
80ca120088 remove checkBucketExist check entirely to avoid fan-out calls (#18917)
Each Put, List, Multipart operations heavily rely on making
GetBucketInfo() call to verify if bucket exists or not on
a regular basis. This has a large performance cost when there
are tons of servers involved.

We did optimize this part by vectorizing the bucket calls,
however its not enough, beyond 100 nodes and this becomes
fairly visible in terms of performance.
2024-01-30 12:43:25 -08:00
Harshavardhana
486e2e48ea enable xattr capture by default (#18911)
- healing must not set the write xattr
  because that is the job of active healing
  to update. what we need to preserve is
  permanent deletes.

- remove older env for drive monitoring and
  enable it accordingly, as a global value.
2024-01-29 23:03:58 -08:00
Harshavardhana
1d3bd02089 avoid close 'nil' panics if any (#18890)
brings a generic implementation that
prints a stack trace for 'nil' channel
closes(), if not safely closes it.
2024-01-28 10:04:17 -08:00
Harshavardhana
74851834c0 further bootstrap/startup optimization for reading 'format.json' (#18868)
- Move RenameFile to websockets
- Move ReadAll that is primarily is used
  for reading 'format.json' to to websockets
- Optimize DiskInfo calls, and provide a way
  to make a NoOp DiskInfo call.
2024-01-25 12:45:46 -08:00
Harshavardhana
e377bb949a migrate bootstrap logic directly to websockets (#18855)
improve performance for startup sequences by 2x for 300+ nodes.
2024-01-24 13:36:44 -08:00
Harshavardhana
52229a21cb avoid reload of 'format.json' over the network under normal conditions (#18842) 2024-01-23 14:11:46 -08:00
Harshavardhana
a0e1163fb6 reject reference format from a different deployment (#18800)
reference format is constant for any lifetime of
a minio cluster, we do not have to ever replace
it during HealFormat() as it will never change.

additionally we should simply reject reference
formats that we do not understand early on.
2024-01-16 15:13:14 -08:00
Harshavardhana
e5c8794b8b avoid disk monitoring leaks under various conditions (#18777)
- HealFormat() was leaking healthcheck goroutines for
  disks, we are only interested in enabling healthcheck
  for the newly formatted disk, not for existing disks.

- When disk is a root-disk a random disk monitor was
  leaking while we ignored the drive.

- When loading the disk for each erasure set, we were
  leaking goroutines for the prepare-storage.go disks
  which were replaced via the globalLocalDrives slice

- avoid disk monitoring utilizing health tokens that
  would cause exhaustion in the tokens, prematurely
  which were meant for incoming I/O. This is ensured
  by avoiding writing O_DIRECT aligned buffer instead
  write 2048 worth of content only as O_DSYNC, which is
  sufficient.
2024-01-12 01:48:36 -08:00
Shubhendu
e31081d79d Heal buckets at node level (#18612)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-01-09 20:34:04 -08:00
Harshavardhana
a50ea92c64 feat: introduce list_quorum="auto" to prefer quorum drives (#18084)
NOTE: This feature is not retro-active; it will not cater to previous transactions
on existing setups. 

To enable this feature, please set ` _MINIO_DRIVE_QUORUM=on` environment
variable as part of systemd service or k8s configmap. 

Once this has been enabled, you need to also set `list_quorum`. 

```
~ mc admin config set alias/ api list_quorum=auto` 
```

A new debugging tool is available to check for any missing counters.
2023-12-29 15:52:41 -08:00
Harshavardhana
5b2ced0119 re-use globalLocalDrives properly (#18721) 2023-12-29 09:30:10 -08:00
Anis Eleuch
8432fd5ac2 prom: Add online and healing drives metrics per erasure set (#18700) 2023-12-21 16:56:43 -08:00
Harshavardhana
7c948adf88 allow pre-allocating buffers to reduce frequent GCs during growth (#18686)
This PR also increases per node bpool memory from 1024 entries
to 2048 entries; along with that, it also moves the byte pool
centrally instead of being per pool.
2023-12-21 08:59:38 -08:00
Harshavardhana
b3314e97a6 re-use the same local drive used by remote-peer (#18645)
historically, we have always kept storage-rest-server
and a local storage API separate without much trouble,
since they both can independently operate due to no
special state() between them.

however, over some time, we have added state()
such as

- drive monitoring threads now there will be "2" of
  them per drive instead of just 1.

- concurrent tokens available per drive are now twice
  instead of just single shared, allowing unexpectedly
  high amount of I/O to go through.

- applying serialization by using walkMutexes can now
  be adequately honored for both remote callers and local
  callers.
2023-12-13 19:27:55 -08:00
Harshavardhana
e30c0e7ca3 Revert "Heal buckets at node level (#18504)"
This reverts commit 708296ae1b.
2023-12-05 22:34:46 -08:00
Shubhendu
708296ae1b Heal buckets at node level (#18504) 2023-12-05 02:17:35 -08:00
Klaus Post
2229509362 fix: leaking offline disks in MarkOffline() thread (#18414)
`monitorAndConnectEndpoints` will continue to attempt to reconnect offline disks.

Since disks were never closed, a `MarkOffline` would continue to try to check these disks forever.

Close previous disks.
2023-11-09 09:33:32 -08:00
Aditya Manthramurthy
1c99fb106c Update to minio/pkg/v2 (#17967) 2023-09-04 12:57:37 -07:00
Harshavardhana
eb55034dfe optimize deletePrefix, use direct set location via object name (#17827)
* optimize deletePrefix, use direct set location via object name

instead of fanning out the calls for an object force delete
we can assume the set location and not do fan-out calls

* Apply suggestions from code review

Co-authored-by: Krishnan Parthasarathi <krisis@users.noreply.github.com>

---------

Co-authored-by: Krishnan Parthasarathi <krisis@users.noreply.github.com>
2023-08-09 16:30:22 -07:00
Harshavardhana
b0f0e53bba fix: make sure to correctly initialize health checks (#17765)
health checks were missing for drives replaced since

- HealFormat() would replace the drives without a health check
- disconnected drives when they reconnect via connectEndpoint()
  the loop also loses health checks for local disks and merges
  these into a single code.
- other than this separate cleanUp, health check variables to avoid
  overloading them with similar requirements.
- also ensure that we compete via context selector for disk monitoring
  such that the canceled disks don't linger around longer waiting for
  the ticker to trigger.
- allow disabling active monitoring.
2023-08-01 10:54:26 -07:00
Harshavardhana
81be718674 fix: optimize DiskInfo() call avoid metrics when not needed (#17763) 2023-07-31 15:20:48 -07:00
Aditya Manthramurthy
5a1612fe32 Bump up madmin-go and pkg deps (#17469) 2023-06-19 17:53:08 -07:00
Klaus Post
c839b64f6a fix: compressed+encrypted block overhead (#17289) 2023-05-26 10:57:07 -07:00
jiuker
f037c9b286 Protecting the read index is not out of bounds (#17226) 2023-05-17 12:09:41 -07:00
Praveen raj Mani
72802a5972 Use 'minio/pkg/sync/errgroup' and 'minio/pkg/workers' (#17069) 2023-04-25 22:57:40 -07:00
Harshavardhana
84f31ed45d simplify MRF, converge it to regular healing (#17026) 2023-04-19 07:47:42 -07:00
Harshavardhana
6825bd7e75 fix: inlined objects don't need to honor long locks (#17039) 2023-04-17 12:16:37 -07:00
Poorna
d1e775313d support decommissioning of tiered objects (#16751) 2023-03-16 07:48:05 -07:00
Harshavardhana
b4ef5ff294 remove unnecessary code checking for supported features (#16423) 2023-01-17 19:37:47 +05:30
Anis Elleuch
2146ed4033 xl: Quit early when EC config is incorrect (#16390)
Co-authored-by: Anis Elleuch <anis@min.io>
2023-01-09 23:07:45 -08:00
Anis Elleuch
0333412148 fix: heal only once per disk per set among multiple disks (#16358) 2023-01-05 20:41:19 -08:00
Harshavardhana
a15a2556c3 converge listBuckets() as a peer call (#16346) 2023-01-03 23:39:40 -08:00
Harshavardhana
f1bbb7fef5 vectorize cluster-wide calls such as bucket operations (#16313) 2023-01-03 08:16:39 -08:00
Harshavardhana
f93183f66e fix: a deadlock by refactoring listBuckets() under site replication (#16323) 2022-12-29 00:08:31 -08:00
Harshavardhana
b882310e2b avoid locks for internal and invalid buckets in MakeBucket() (#16302) 2022-12-23 07:46:00 -08:00
Aditya Manthramurthy
a30cfdd88f Bump up madmin-go to v2 (#16162) 2022-12-06 13:46:50 -08:00
Harshavardhana
5a8df7efb3 re-implement StorageInfo to be a peer call (#16155) 2022-12-01 14:31:35 -08:00
Klaus Post
cc1d8f0057 Check for abandoned data when healing (#16122) 2022-11-28 10:20:55 -08:00
Klaus Post
98ba622679 Reduce temporary file clean-up waits (#16110) 2022-11-22 07:23:36 -08:00
Klaus Post
ecc932d5dd Clean entire tmp-old on restart (#15979) 2022-10-31 07:27:50 -07:00
Harshavardhana
23b329b9df remove gateway completely (#15929) 2022-10-24 17:44:15 -07:00
Klaus Post
3c605c93fe warn when 0 parity has been set as default parity (#15790) 2022-10-04 22:41:42 -07:00
Klaus Post
a9f1ad7924 Add extended checksum support (#15433) 2022-08-29 16:57:16 -07:00
ebozduman
b57e7321e7 Replaces 'disk'=>'drive' visible to end user (#15464) 2022-08-04 16:10:08 -07:00
Poorna
426c902b87 site replication: fix healing of bucket deletes. (#15377)
This PR changes the handling of bucket deletes for site 
replicated setups to hold on to deleted bucket state until 
it syncs to all the clusters participating in site replication.
2022-07-25 17:51:32 -07:00
Anis Elleuch
ed0cbfb31e fix: rootdisk detection by not using cached value when GetDiskInfo() errors out (#15249)
GetDiskInfo() uses timedValue to cache the disk info for one second.

timedValue behavior was recently changed to return an old cached value
when calculating a new value returns an error.

When a mount point is empty, GetDiskInfo() will return errUnformattedDisk,
timedValue will return cached disk info with unexpected IsRootDisk value,
e.g. false if the mount point belongs to a root disk. Therefore, the mount
point will be considered a valid disk and will be formatted as well.

This commit will also add more defensive code when marking root disks:
always mark a disk offline for any GetDiskInfo() error except
errUnformattedDisk. The server will try anyway to reconnect to those
disks every 10 seconds.
2022-07-07 17:05:23 -07:00
Harshavardhana
32b2f6117e fix: do not pass around sync.Map (#15250)
it is not safe to pass around sync.Map
through pointers, as it may be concurrently
updated by different callers.

this PR simplifies by avoiding sync.Map
altogether, we do not need sync.Map
to keep object->erasureMap association.

This PR fixes a crash when concurrently
using this value when audit logs are
configured.

```
fatal error: concurrent map iteration and map write

goroutine 247651580 [running]:
runtime.throw({0x277a6c1?, 0xc002381400?})
        runtime/panic.go:992 +0x71 fp=0xc004d29b20 sp=0xc004d29af0 pc=0x438671
runtime.mapiternext(0xc0d6e87f18?)
        runtime/map.go:871 +0x4eb fp=0xc004d29b90 sp=0xc004d29b20 pc=0x41002b
```
2022-07-07 17:04:25 -07:00