feat: Improve startup times by setting flow.election.max.candidates#953
Open
sbernauer wants to merge 3 commits into
Open
feat: Improve startup times by setting flow.election.max.candidates#953sbernauer wants to merge 3 commits into
sbernauer wants to merge 3 commits into
Conversation
Member
Author
9 tasks
Member
Author
lfrancke
requested changes
Jul 1, 2026
| PROTOCOL_PORT.to_string(), | ||
| ); | ||
|
|
||
| // In case the number of NiFi nodes is hard-coded to a fixed (no auto-scaling), we can tell |
Member
There was a problem hiding this comment.
Suggested change
| // In case the number of NiFi nodes is hard-coded to a fixed (no auto-scaling), we can tell | |
| // In case the number of NiFi nodes is hard-coded to a fixed number (no auto-scaling), we can tell |
| /// | ||
| /// This is the case when all `replicas` are set to [`Some<u16>`], in which case they are simply | ||
| /// summed. | ||
| pub fn maybe_fixed_node_count(&self) -> Option<u32> { |
Member
There was a problem hiding this comment.
Maybe guard against count being 0 for WHATEVER reason. In that case I'd prefer to emit None instead of Some(0).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
TLDR: In case the user specifies a fixed number of NiFi nodes (i.e. no auto-scaling), set
nifi.cluster.flow.election.max.candidatesto that number. This results in much faster NiFi startups, as it doesn't need to wait for the 5 minutes ofnifi.cluster.flow.election.max.wait.time(disclaimer: AI helped my on the description, code is mine)
NiFi uses flow election on cold start: nodes vote on which flow definition wins, and the cluster won't finish coming up until election settles. Two things gate that:
In #936 we correctly raised max.wait.time back to NiFi's upstream default of 5 minutes (the previous 1 min was a leftover "for testing" value that risked electing on incomplete vote sets). That fixed the correctness problem, but it left every cold start paying up to 5 minutes before the cluster is usable — because we were leaving max.candidates empty, so the timeout was the only thing that ended election.
The pain is real and already visible: many of our own kuttl tests, as well as customers, have been reaching for configOverrides to lower max.wait.time again simply because a 5-minute startup is too slow to live with.
This PR resolves the tension instead of trading one problem for the other. When the number of NiFi nodes is fixed (all role-group replicas are set, i.e. no autoscaling), the operator knows the exact node count and sets max.candidates to it. On cold start the cluster now elects the moment all expected nodes have reported in — typically seconds — while max.wait.time stays at the safe 5-minute upstream default as a fallback for the degraded case (a node that never joins).
The result:
This is safe on our StatefulSet setup specifically because pods use podManagementPolicy: Parallel, so all expected nodes start concurrently and the "elect once all candidates are present" fast path can actually trigger. When replicas is left unset anywhere (autoscaling), we fall back to the previous empty value.
Definition of Done Checklist
Author
Reviewer
Acceptance
type/deprecationlabel & add to the deprecation scheduletype/experimentallabel & add to the experimental features trackerRelease notes
For clusters with a fixed number of nodes, the operator now sets
nifi.cluster.flow.election.max.candidatesto the total node count, so cold-start flow election completes as soon as all expected nodes report in — typically seconds instead of the up-to-5-minutemax.wait.time timeout. The timeout stays at NiFi's upstream default of 5 minutes as a fallback, so there's no correctness change if fewer nodes than expected show up.