There is a particular kind of shame that comes with SSHing into a production server to manually apply a manifest. You know it's wrong. The change is untracked, unreviewable, and exists in exactly one place: your terminal history, which you will clear out of guilt. The cluster has now drifted from whatever you have in Git. You tell yourself you'll document it later. You won't.
This is not a character flaw. It's a tooling problem, and it's solvable.
Flux CD is a set of Kubernetes controllers that continuously synchronize your cluster state with configuration stored in a Git repository. The core promise is simple: if it isn't in Git, it doesn't belong in the cluster. If the cluster drifts from Git, Flux corrects it -- automatically, on an interval, without you having to remember. That combination of automatic correction and pull-based operation is what separates a real GitOps workflow from a CI/CD pipeline that just happens to run kubectl apply in the last step.
Flux was accepted to the CNCF on July 15, 2019, and reached Graduated maturity on November 30, 2022 -- the highest tier in the CNCF project lifecycle. Global corporations including Volvo, SAP, and RingCentral rely on Flux in production. Deutsche Telekom uses it to manage approximately 200 Kubernetes clusters with a team of 10 engineers. Cloud providers and vendors including AWS, Microsoft Azure, Red Hat, D2iQ, VMware, and Weaveworks offer Flux-based GitOps to their enterprise customers.
Why Pull-Based Deployment Changes the Security Equation
Before getting into Flux mechanics, it's worth understanding why the pull model is architecturally better than having your CI pipeline push manifests to the cluster.
In a push-based model -- GitHub Actions running kubectl apply, for example -- your pipeline needs credentials with write access to the cluster. Those credentials live in your CI environment. If that CI environment is compromised, an attacker has a direct path into production. The blast radius of a compromised pipeline token is your entire cluster.
Flux's pull model means engineers don't need direct cluster access for deployments. The cluster reaches out to your Git repository on its own schedule. Nothing external needs push access to the cluster API. The security perimeter is dramatically smaller. The principle of least privilege is honored by the architecture, not by careful credential rotation.
The reconciliation is also continuous: if someone with kubectl access manually changes a resource, Flux will revert it on the next sync cycle. This is either a feature or a gotcha depending on whether you knew about it before applying a one-off hotfix.
The GitOps Toolkit: What Is Actually Running in Your Cluster
Flux v2 is not a monolithic operator. It is a collection of purpose-built controllers the project calls the GitOps Toolkit. Understanding what each controller does is essential to understanding how Flux behaves in production.
Source Controller is the entry point for almost everything. It is a Kubernetes operator specialized in artifact acquisition from external sources such as Git, OCI, Helm repositories, and S3-compatible buckets. Its job is to watch your sources, detect changes, and make resulting artifacts available to the other controllers. It does not apply anything -- it only fetches and stores. This separation matters because you can have a GitRepository defined once and have multiple Kustomization and HelmRelease objects all consuming from it.
Kustomize Controller watches Kustomization objects (Flux's kind, not Kustomize's native kind -- more on this naming collision shortly), fetches the artifact from the Source Controller, runs kustomize build against the specified path, and applies the resulting manifests to the cluster. It also handles pruning: if you delete a manifest from Git, Flux will delete the corresponding resource from the cluster. This is opt-in via spec.prune: true and you should be thoughtful about enabling it in environments where accidental deletions are catastrophic.
Helm Controller watches HelmRelease objects. The HelmRelease API allows for controller-driven reconciliation of Helm releases via Helm actions such as install, upgrade, test, uninstall, and rollback. In addition to this, it detects and corrects cluster state drift from the desired release state. Where the Kustomize Controller works with raw or kustomized manifests, the Helm Controller uses the full Helm machinery -- values, lifecycle hooks, and the Helm release history stored in cluster secrets.
Notification Controller is often overlooked during initial setup and then desperately missed when something goes wrong. It handles events coming from external systems such as GitHub, GitLab, Bitbucket, and Harbor, notifying the GitOps toolkit controllers about source changes. It also dispatches events emitted by the toolkit controllers to external systems -- Slack, Microsoft Teams, Discord -- based on event severity and involved objects. Set up alerting before you go to production, not after.
The Reconciliation Loop in Detail
The term "reconciliation loop" is used loosely across the Kubernetes ecosystem. In Flux's context it means something specific, and it is worth tracing through exactly what happens from a Git commit to a running workload.
- The Source Controller polls your
GitRepositoryon whatever interval you've configured. The interval is a design choice: a 1-minute interval gives you faster drift correction but more polling traffic; 5-10 minutes is a common production default. When a new commit is detected, it fetches the repository, stores it as a compressed tarball artifact on the Source Controller's local storage, and updates theGitRepositorystatus with the new revision. - The Kustomize Controller (or Helm Controller, depending on your setup) watches for changes to the
GitRepositorystatus. When it sees a new revision, it retrieves the artifact from the Source Controller via an internal HTTP endpoint -- not by cloning Git directly -- and processes it. - The controller applies the processed manifests against the cluster API using server-side apply. The kustomize-controller maintains a history of the last 5 reconciliations in
.status.history, including the digest of the applied manifests, the source revision, timestamps, duration, the status of the most recent reconciliation (lastReconciledStatus), and total number of times a specific digest was reconciled. - After applying, the controller runs health checks on deployed resources and reports readiness through standard Kubernetes conditions.
The reconciliation interval is not your deployment latency. New commits trigger a reconciliation immediately (via webhook or on the next poll cycle). The interval handles the case where someone has manually changed something that shouldn't have been changed -- it's your drift-correction window.
This also means you need to be careful with resources that Kubernetes itself legitimately modifies. If you're using a HorizontalPodAutoscaler, you'll need to omit the spec.replicas field from your Deployment YAMLs. Otherwise Flux will continually reset the replica count and fight with your HPA -- both will be wrong, and neither will tell you why.
Kustomization vs HelmRelease: Picking the Right Tool
This is where many teams get confused, partly because the terminology collides in unfortunate ways. There are two completely different things named "Kustomization" in a Flux setup:
kustomization.kustomize.config.k8s.io-- the native Kustomize file that tellskustomize buildhow to assemble manifestskustomization.kustomize.toolkit.fluxcd.io-- the Flux custom resource that tells the kustomize-controller where to look and what to do
The Flux Kustomization (API group: kustomize.toolkit.fluxcd.io) is a deployment pipeline -- it tells Flux how and where to reconcile. The Kustomize kustomization.yaml file (API group: kustomize.config.k8s.io) is configuration data that builds manifests. They are related but not the same thing. Conflating them is the single most common source of confusion for engineers new to Flux, and the error messages won't help you tell the difference. When you read Flux docs, always check which API group is being referenced.
Use a Flux Kustomization when you have plain Kubernetes manifests or Kustomize overlays that you maintain directly. Your team writes YAML, commits it, and wants Flux to apply it. This is the right choice for your own application manifests, namespace definitions, RBAC, custom resource definitions, and anything where you own the source of truth entirely.
apiVersion: kustomize.toolkit.fluxcd.io/v1 kind: Kustomization metadata: name: apps namespace: flux-system spec: interval: 10m path: ./apps/production prune: true sourceRef: kind: GitRepository name: flux-system wait: true timeout: 5m0s
The path field points to a directory in your GitRepository. If there is a kustomization.yaml in that directory, Flux uses it. If there isn't one, Flux generates one that includes all YAML files in the directory.
prune: true means Flux will delete resources from the cluster when you delete them from Git. This is one of the most powerful -- and most dangerous -- features in Flux. A misguided commit that removes a directory, a wrong spec.path, or an accidental branch switch can trigger deletion of live workloads. In production, verify that you have a cluster backup strategy before enabling pruning, and test pruning behavior on staging first. Some teams enable prune: true everywhere; others explicitly disable it on namespaces that hold stateful workloads. Both are valid positions. The key is intentionality.
Use a HelmRelease when you are deploying third-party software packaged as a Helm chart: cert-manager, ingress-nginx, external-secrets, Prometheus, whatever. You do not own the chart. You configure it via values, and you want Flux to handle upgrades, rollbacks, and drift correction for the Helm release state.
apiVersion: source.toolkit.fluxcd.io/v1 kind: HelmRepository metadata: name: cert-manager namespace: flux-system spec: interval: 1h url: https://charts.jetstack.io --- apiVersion: helm.toolkit.fluxcd.io/v2 kind: HelmRelease metadata: name: cert-manager namespace: cert-manager spec: interval: 30m chart: spec: chart: cert-manager version: ">=1.14.0 <2.0.0" sourceRef: kind: HelmRepository name: cert-manager namespace: flux-system values: installCRDs: true install: remediation: retries: 3 upgrade: remediation: retries: 3 remediateLastFailure: true
The version field accepts semver ranges. For staging you might use ">=1.0.0-alpha" to track pre-releases. For production, use tighter constraints like ">=1.14.0 <2.0.0" to stay on a major version until you explicitly choose to upgrade.
The install.remediation.retries and upgrade.remediation blocks are important and frequently omitted in tutorials. By embedding recovery logic directly into HelmRelease definitions, rollbacks become predictable, machine-led actions rather than someone scrambling to remember the right helm rollback invocation at 3 AM. Set remediateLastFailure: true on upgrades so Flux rolls back a failed upgrade rather than leaving the release in a broken state.
Structuring Your Repository So Future You Doesn't Hate Present You
Repository structure is the part of Flux that documentation covers thoroughly but where practical guidance is thin. The question isn't which structure is valid -- several are. The question is which one you will still be able to reason about when the repository has grown for 18 months and three engineers have left the team.
The most durable structure separates concerns by layer, not by application. The Flux project's own reference example uses this pattern:
. ├── apps │ ├── base │ ├── production │ └── staging ├── infrastructure │ ├── configs │ └── controllers └── clusters ├── production └── staging
Each cluster directory contains only pointers -- Flux Kustomization objects that reference paths in infrastructure/ and apps/. The actual configuration lives in those directories. Changes to files outside the apps/ directories will not trigger a reconciliation of the apps Kustomization. A change to an infrastructure controller does not trigger reconciliation of your application layer, and vice versa. The dependency boundary is explicit.
infrastructure/controllers/ holds HelmRelease objects for cluster-wide tooling: cert-manager, ingress controllers, secrets operators, monitoring stacks. These are things that namespaces and applications depend on and need to be reconciled before applications start.
infrastructure/configs/ holds ClusterIssuer objects, default NetworkPolicy objects, StorageClass definitions -- anything that configures the infrastructure layer rather than installs software into it.
apps/base/ holds application definitions common across all environments: a HelmRelease for your API service with sensible defaults, a Deployment and Service for your frontend. No environment-specific values.
apps/production/ and apps/staging/ hold Kustomize patches that override base values per environment. Production gets different replica counts, different resource limits, different ingress hostnames. Staging might enable pre-release chart versions for early testing.
If you instead put every application in its own top-level directory with its own cluster-specific subdirectories, you will have a repository that is impossible to grep through and requires intimate familiarity to navigate. The layer-first structure keeps "what clusters exist" answerable from one directory and "what runs in production" answerable from another.
Dependency Management: dependsOn and Why You Need It
Order of operations matters. Your application's HelmRelease cannot succeed if cert-manager's CRDs haven't been installed yet. Flux provides dependsOn to express this explicitly:
apiVersion: helm.toolkit.fluxcd.io/v2 kind: HelmRelease metadata: name: my-api namespace: my-api spec: dependsOn: - name: cert-manager namespace: cert-manager - name: ingress-nginx namespace: ingress-nginx interval: 15m chart: spec: chart: my-api version: ">=0.1.0" sourceRef: kind: HelmRepository name: my-api-charts namespace: flux-system
Flux will not attempt to reconcile my-api until both cert-manager and ingress-nginx are in a Ready state. This is a hard dependency, not a best-effort ordering. If cert-manager is failing, my-api will stay in a pending state and report why. That behavior is correct -- a partially reconciled cluster with silent missing dependencies is worse than one that has stopped and told you what it's waiting for.
Cross-resource type dependencies are not directly supported. A Kustomization can only declare dependsOn against another Kustomization, and a HelmRelease can only declare dependsOn against another HelmRelease. You cannot write a Kustomization that directly depends on a HelmRelease. The workaround is to wrap your HelmRelease in a Kustomization with spec.healthChecks targeting the HelmRelease, then have your downstream Kustomization depend on that wrapper. It is indirect but reliable. Circular dependencies will silently deadlock. If Kustomization A depends on B and B depends on A, neither will ever reconcile. Flux does not detect cycles at admission time -- it will simply leave both resources in a pending state with a DependencyNotReady condition indefinitely.
The same dependsOn field is available on Kustomization objects. Use it to ensure your infrastructure/configs/ layer reconciles after your infrastructure/controllers/ layer. The explicit dependency graph is one of the genuinely underrated aspects of Flux's design.
Secrets: The Part Everyone Skips Until It's Too Late
GitOps creates an immediate tension with secret management. The whole point is that Git is the source of truth. But you cannot commit plaintext secrets to Git, not even to a private repository. There are two practical approaches that work well with Flux.
SOPS (Secrets OPS) encrypts secret files before they are committed to Git. Flux's kustomize-controller has native SOPS decryption support -- it can decrypt secrets using an AWS KMS key, GCP KMS key, Azure Key Vault key, or a local age key before applying them to the cluster. The workflow looks like this:
# Encrypt a secret values file before committing $ sops -e --input-type=yaml --output-type=yaml values.yaml > my-values.enc.yaml # Commit the encrypted file; discard the plaintext $ git add my-values.enc.yaml && git commit -m "Add encrypted values" # Flux decrypts transparently at reconciliation time # using a key stored as a cluster secret during bootstrap
External Secrets Operator is the alternative for teams already using a secrets manager -- Vault, AWS Secrets Manager, GCP Secret Manager. You commit an ExternalSecret object that describes where to find the secret and how to map it to a Kubernetes Secret. The actual secret value never touches Git. The tradeoff is that it introduces a runtime dependency: your secrets manager must be reachable for the ExternalSecret to sync.
Both approaches are valid. SOPS is simpler for teams starting out. External Secrets Operator is better for organizations that already have centralized secret management and compliance requirements around it. Do not let "figuring out secrets" be a reason to delay adopting GitOps -- the SSH-into-production alternative is worse in every dimension.
Suspending and Forcing Reconciliation
Two operational patterns you will need within the first week of running Flux in production:
Suspending reconciliation is how you pause Flux's management of a resource without deleting it. This is essential during maintenance windows, manual debugging, or when you need to make a cluster-side change before the corresponding Git change is ready.
# Suspend a Kustomization $ flux suspend kustomization apps # Suspend a HelmRelease $ flux suspend helmrelease my-api -n my-api # Resume when ready $ flux resume kustomization apps
Forcing reconciliation immediately triggers a sync cycle without waiting for the interval to expire. Useful after a commit to verify behavior, or after a change to a dependency. Note the distinction between flux reconcile source and using the --with-source flag on other commands:
# Force a Kustomization to reconcile using the last known source revision $ flux reconcile kustomization apps # Force a HelmRelease $ flux reconcile helmrelease my-api -n my-api # Force the source to re-fetch from Git FIRST, then reconcile downstream # Use this after a commit to pick up changes immediately $ flux reconcile source git flux-system # Equivalent: reconcile a kustomization AND re-fetch its source in one step $ flux reconcile kustomization apps --with-source
flux reconcile kustomization apps tells the kustomize-controller to re-process whatever artifact the source-controller already has cached. It does not re-fetch from Git. If you've just pushed a commit and want it picked up immediately, run flux reconcile source git flux-system first, or use the --with-source flag to do both in one step. If you skip the source fetch and the poll interval hasn't fired yet, you'll be reconciling against the old revision.
Observability: Not Optional
A common mistake in early Flux setups is treating it as a background process and finding out something has been failing for three days only when a deployment isn't where you expected it. Flux exposes everything you need -- the question is whether you wire it up.
For production, you want alerts. The Notification Controller makes this straightforward:
apiVersion: notification.toolkit.fluxcd.io/v1beta3 kind: Provider metadata: name: slack namespace: flux-system spec: type: slack channel: ops-alerts secretRef: name: slack-webhook-url --- apiVersion: notification.toolkit.fluxcd.io/v1beta3 kind: Alert metadata: name: flux-system-alerts namespace: flux-system spec: providerRef: name: slack eventSeverity: error eventSources: - kind: Kustomization name: "*" - kind: HelmRelease name: "*"
The wildcard "*" on eventSources gives you alerts for all Kustomizations and HelmReleases that fail reconciliation. That's the minimum viable alerting setup.
For deeper visibility, Flux v2.7 introduced OpenTelemetry tracing for Kustomization and HelmRelease reconciliation. This is configured through the Notification Controller using a Provider of type otel -- it is separate from the Slack/Teams alerting setup above and requires its own Provider resource. Once configured, the notification-controller converts Flux events into OTEL spans with proper trace relationships based on the Flux object hierarchy: source objects (GitRepository, HelmChart, OCIRepository, Bucket) create root spans, while Kustomization and HelmRelease objects create child spans within the same trace. If your organization already has an observability stack that accepts OTEL traces, plugging Flux into it gives you a complete picture of reconciliation timing and dependency ordering -- which is especially useful when you have long dependsOn chains and need to understand where time is being spent across a bootstrap sequence.
# Requires Flux v2.7+ -- configure an OTEL provider in addition to your alert providers apiVersion: notification.toolkit.fluxcd.io/v1beta3 kind: Provider metadata: name: jaeger namespace: flux-system spec: type: otel address: http://jaeger-collector.jaeger:4318/v1/traces --- # Then create an Alert pointing to this provider with the sources you want to trace apiVersion: notification.toolkit.fluxcd.io/v1beta3 kind: Alert metadata: name: flux-otel-tracing namespace: flux-system spec: providerRef: name: jaeger eventSeverity: info eventSources: - kind: GitRepository name: "*" - kind: Kustomization name: "*" - kind: HelmRelease name: "*"
Common Failure Modes and How to Debug Them
Flux is reconciling but nothing is changing. Check whether the resource is suspended: flux get kustomization <name>. Check the source: flux get source git flux-system. Look at the controller logs: flux logs --kind=Kustomization --name=<name>.
A HelmRelease is stuck in a failed state. The helm-controller will retry up to the configured number of retries, then stop. Check flux get helmrelease <name> -n <namespace> for the error message. Often it's a missing dependency, an invalid values override, or a chart version that doesn't exist. Once you've fixed the underlying issue, force a reconciliation.
Flux keeps overwriting a field you need to change in-cluster. This is expected behavior, not a bug. Either add the field to your Git-managed manifests, or manage the field through a tool that Flux controls (an HPA for replicas, for example). If you genuinely need a persistent in-cluster override that shouldn't be in Git, you have a GitOps design problem, not a Flux problem.
A Kustomization fails with kustomization path not found. The spec.path in your Flux Kustomization doesn't match the directory structure in your repository. Paths are relative to the root of the GitRepository, not to the kustomization.yaml file. Double-check the path and verify it exists in the repository at the correct revision.
# Check status of all Flux objects $ flux get all -A # Check logs for a specific Kustomization $ flux logs --kind=Kustomization --name=apps --level=error # Describe a HelmRelease for detailed status $ kubectl describe helmrelease my-api -n my-api # Force re-fetch the source and reconcile $ flux reconcile source git flux-system --with-source
What Breaks First: Operational Failures You Won't Find in the Docs
Documentation covers the happy path. Production punishes you for not knowing the failure modes. These are the situations that trip up teams in their first weeks with Flux in production, none of which are prominently documented anywhere.
Your Git provider has an outage
This one surprises people: when GitHub or GitLab goes down, your running cluster is fine. Flux's source-controller maintains a local cache of all external sources, meaning drift detection and correction continue against the last successfully fetched artifact. Your workloads keep running at the last reconciled state. What stops working is your ability to push new changes into the cluster -- any pending commits will not be picked up until the source becomes reachable again. The GitRepository will show a False Ready condition with a fetch error, but it will not touch your running workloads. This is one of the most significant operational resilience properties of the pull model, and it is worth communicating to leadership before someone interprets a GitHub outage as a deployment emergency.
An engineer does their first kubectl edit
Someone unfamiliar with GitOps edits a ConfigMap directly in the cluster to test a configuration change. It looks like it worked. Two minutes later, Flux silently reverts it on the next reconciliation cycle. They edit it again. Flux reverts it again. They escalate, convinced something is broken. This scenario plays out in nearly every team adopting Flux for the first time. The fix is not technical -- it is cultural and communicative. Document the reconciliation reversion behavior before onboarding any engineer who will touch the cluster, and add a team convention about using flux suspend before making intentional in-cluster changes during debugging. The command is fast; the learning curve without it is not.
A long dependsOn chain on bootstrap
The first time you bootstrap a cluster with a long dependency chain -- infrastructure controllers, then configs, then apps -- the total time can shock you. Each layer waits for the previous layer's health checks to pass before starting. If cert-manager takes 90 seconds to become ready, and then ingress-nginx takes another 60 seconds, and then your app layer starts, you might be 5-10 minutes into a bootstrap before apps begin reconciling. This is correct and expected behavior, but teams unfamiliar with it interpret it as Flux being slow or broken and start manually intervening, which makes things worse. Accept that bootstrap is not instant, monitor it with flux get all -A --watch, and let the dependency graph do its job.
prune deletes more than you expected
You rename a directory in your repository. You push the commit. Flux sees that the resources previously at the old path are now absent, and with prune: true, deletes the corresponding cluster resources. It then creates new ones at the new path. Depending on the resources involved, this can mean brief downtime or worse. The failure mode is not that Flux behaved incorrectly -- it behaved exactly as configured. The failure is that the operator did not anticipate that a rename would look identical to a deletion from Flux's perspective. When renaming paths or restructuring your repository, always check which resources would be affected by pruning before pushing. flux diff kustomization <name> can show you what Flux intends to apply against the current cluster state.
# Preview what Flux would apply/delete against live cluster state $ flux diff kustomization apps --path ./apps/production # Watch the full bootstrap sequence in real time $ flux get all -A --watch # See the last 5 reconciliation history entries for a Kustomization $ kubectl get kustomization apps -n flux-system -o jsonpath='{.status.history}' | jq
Multi-Tenancy: Flux on Shared Clusters
If your cluster serves multiple teams or projects, multi-tenancy is not optional -- it is a security requirement. Flux has first-class multi-tenancy support, but it requires explicit configuration. The default installation does not enforce tenant isolation.
Flux's multi-tenancy model is built on standard Kubernetes RBAC. Each tenant is given a namespace and a ServiceAccount with appropriately scoped permissions. The kustomize-controller can impersonate that ServiceAccount when reconciling a tenant's Kustomization, which means the reconciliation will fail if the tenant's manifests try to create resources outside their permitted namespace. This is controlled via the spec.serviceAccountName field on a Kustomization.
apiVersion: kustomize.toolkit.fluxcd.io/v1 kind: Kustomization metadata: name: team-alpha-apps namespace: team-alpha spec: interval: 10m path: ./apps prune: true serviceAccountName: flux-reconciler # flux-reconciler ServiceAccount has RBAC limited to the team-alpha namespace sourceRef: kind: GitRepository name: team-alpha-repo
For stricter isolation, platform admins can start kustomize-controller and helm-controller each with the --no-cross-namespace-refs=true flag. This flag must be set on each controller independently -- setting it on kustomize-controller alone does not enforce the restriction on HelmRelease objects. Once set on a given controller, it prevents any resource managed by that controller from referencing sources in a different namespace -- tenants cannot point their workloads at sources owned by another team. Without this flag, a tenant could potentially craft a manifest that references a privileged GitRepository in the flux-system namespace.
Argo CD handles multi-tenancy through its Projects abstraction, which provides a user-friendly GUI for defining team boundaries and comes with built-in SSO integration. Flux's approach is more surgical: it delegates everything to Kubernetes RBAC, which means less overhead per cluster but more manual configuration per tenant. Teams that already have strong Kubernetes RBAC patterns in place typically find Flux's model cleaner. Teams new to Kubernetes who want a centralized access control layer with a visual interface typically find Argo CD easier to operate at the multi-team level.
Flux vs Argo CD: An Honest Comparison
Any team evaluating Flux will encounter Argo CD. Both are CNCF Graduated projects. Both solve the same core problem. The choice between them is real, and the answer is not "use both" -- running two GitOps operators on the same cluster creates competing reconciliation loops and support complexity that nobody wants.
The most accurate characterization of the difference is architectural philosophy. Argo CD is application-centric and ships with a rich web UI, an internal user and RBAC system independent from Kubernetes, and an Application CRD that packages source, destination, and sync policy into one resource. It is batteries-included. Flux is controller-centric and Kubernetes-native: everything is a CRD reconciled by a specialized controller, access control is Kubernetes RBAC, and there is no built-in UI. It is composable.
This produces concrete tradeoffs:
- If your team values a visual dashboard for real-time deployment state, drift visualization, and the ability to let non-Kubernetes engineers monitor deployments without learning
kubectl, Argo CD is a significant advantage. Flux has no native UI. Third-party UIs exist, but they are not the default experience. - If your team operates large-scale multi-cluster infrastructure and wants modular, resource-efficient controllers that can be tuned independently, Flux's architecture scales better. Deutsche Telekom manages approximately 200 Kubernetes clusters with Flux and a team of 10 engineers -- a data point from the CNCF that reflects the tool's production ceiling.
- If you are already invested in the Argo ecosystem -- Argo Workflows, Argo Rollouts, Argo Events -- Argo CD's integration with that stack is natural. Flux's progressive delivery story runs through Flagger, a separate project.
- If you are building on Kubernetes for the first time and want to get started quickly with a tool that is approachable, Argo CD's UI reduces the learning curve considerably for engineers who are not yet fluent with CRDs and controller patterns.
Both tools were graduated by the CNCF in November 2022 within days of each other, a deliberate acknowledgment that both are production-ready and that the choice between them is legitimately a matter of team fit rather than maturity. Per AWS prescriptive guidance, the selection should be driven by whether you require lighter-weight solutions and deep CNCF integration (Flux) versus visual management and application-centric workflows (Argo CD).
Bootstrapping: Getting Flux Into the Cluster
The flux bootstrap command installs the controllers, creates a deploy key on your Git provider, commits the Flux manifests to your repository, and sets up the initial GitRepository and Kustomization that watches itself. Once bootstrapped, updates to the Flux controllers themselves go through GitOps -- Flux manages Flux.
$ export GITHUB_TOKEN=<your-token> $ flux bootstrap github \ --owner=your-org \ --repository=fleet-infra \ --branch=main \ --path=clusters/production \ --personal # Verify the install $ flux check $ flux get all -A
The bootstrap command creates a deploy key with read-only access on GitHub so the cluster can pull changes. Read-only access is the secure default for standard reconciliation and is sufficient for everything Flux does during normal operation -- the cluster's outbound connection to GitHub does not need write access to apply manifests. Note that the bootstrap command itself requires a token with write access to commit the Flux manifests to your repository. Some teams may configure a read-write key if they also need Flux to update commit statuses or push back to the repository, but this is not required for basic GitOps workflows.
The image automation controllers -- which update manifests when new container images are published -- are a separate, optional feature that reached GA status in Flux v2.7. They require a separate write-access deploy key or token, configured independently from the bootstrap deploy key. If you later enable image automation and assume the bootstrap key covers it, you will get authentication errors that are not always obvious to trace back to the key scoping issue. Provision the write-access credentials as a separate step when enabling image automation.
After bootstrap, updates to the Flux controllers themselves go through GitOps. That's elegant. It also means a misconfigured Flux manifests commit can prevent Flux from reconciling itself, leaving the cluster unable to self-heal. If you accidentally push a broken gotk-components.yaml or a malformed gotk-sync.yaml to your bootstrap path, Flux will attempt to apply it, potentially break its own controllers, and then be unable to reconcile the fix you push next. Keep a copy of a known-good bootstrap state somewhere external to the repo, and test Flux upgrades on staging before committing them to production's bootstrap path.
The Actual Shift
The hardest part of adopting Flux is not the tooling. The tooling is well-documented, actively maintained, and has a large enough community that your specific problem almost certainly has a solved example somewhere in the Flux GitHub issues or the #flux channel on the CNCF Slack at slack.cncf.io.
The hard part is the operational discipline. GitOps only works if Git is actually the source of truth. That means no more kubectl apply -f when you're in a hurry. No more editing a ConfigMap directly in the cluster to test something. No more "I'll document the change later." Every change goes through a commit. Every commit gets reviewed. Every deployment is auditable.
Flux is a system designed to detect and correct drift deterministically. What it cannot do is make you use it correctly. That part is on the team.
In mature distributed systems, drift is not an anomaly -- it is an expected outcome of continuous change, controller behavior, and human intervention. The real risk is not drift itself, but systems that are not designed to detect and correct it deterministically.
Sources
- Flux CD Official Documentation -- fluxcd.io
- HelmReleases CRD Reference -- fluxcd.io
- Kustomizations CRD Reference -- fluxcd.io
- Flux v2 Kustomize-Helm Example Repository -- GitHub
- Flux v2.7 GA Release Notes -- fluxcd.io
- Flux FAQ -- fluxcd.io
- Flux CNCF Project Page -- cncf.io
- Moving Secure GitOps Forward with Flux -- CNCF Blog
- Flux Graduates from CNCF Incubator -- cncf.io
- FluxCD Multi-cluster Architecture (Git provider outage behavior) -- Stefan Prodan, Medium
- Argo CD and Flux Use Cases -- AWS Prescriptive Guidance
- Flux Installation and Bootstrap -- fluxcd.io
- GitRepository Source API Reference -- fluxcd.io