Disaster Recovery And Drift Management

Problem Statement

Today the IDP writes Crossplane managed resources directly to the management AKS cluster. That means the cluster currently acts as both:

the reconciliation plane,
and the only durable record of desired infrastructure state.

That is not sufficient for disaster recovery or for drift-aware lifecycle management.

If the management cluster is impaired or destroyed:

existing Azure resources usually continue to exist,
Crossplane stops reconciling,
the current portal loses authoritative desired state for dynamically created resources,
and the UI cannot reliably distinguish healthy resources from drifted or orphaned resources.

Target Outcome

The platform should be able to:

persist intended resource configuration outside Kubernetes,
observe live Azure resource configuration independently of Crossplane availability,
classify every managed resource into a small set of actionable states,
offer a safe sync-back flow through Crossplane,
restore the control plane quickly with Velero,
fall back to adoption of existing Azure resources when restored Kubernetes state is incomplete.

Core Architecture

The recommended control model is:

Crossplane remains the reconciler.
The IDP owns durable desired state.
Azure becomes the cloud-observed truth for drift detection.
The UI reads from a projection model, not directly from raw Crossplane objects.

Control Planes

The design separates four concerns.

1. Desired State Store

Persist desired state before the platform writes to the Kubernetes API.

This store should hold:

workflow request identity,
desired resource specification,
ownership metadata,
Crossplane binding metadata,
Azure binding metadata,
remediation and lifecycle history.

2. Crossplane Reconciliation Plane

Crossplane on AKS remains responsible for applying and reconciling Azure resources. It should not remain the only source of truth for what the developer asked for.

3. Azure Observation Plane

Observe Azure independently of cluster health.

For the initial storage-account scope:

use Azure Resource Graph for broad inventory,
use targeted ARM reads for detailed field comparison,
correlate resources using ARM ID and IDP ownership tags.

4. UI Read Model

The UI should read a projection that joins:

desired state,
Crossplane state,
Azure-observed state,
classification and drift details.

Classification Model

The first-class platform states are:

In sync
Pending
Drifted
Cloud-only
Desired-only
Control plane unavailable

These states should be computed from three inputs:

desired state,
Crossplane state,
Azure-observed state.

Recommended Interpretation

In sync: desired state exists, Azure resource exists, owned fields match.
Pending: the resource is provisioning, deleting, or reconciling.
Drifted: Azure differs from desired state on owned fields.
Cloud-only: Azure resource exists but no desired-state record or Crossplane binding exists.
Desired-only: desired state exists but Azure resource does not.
Control plane unavailable: the management cluster or Crossplane API path is unavailable. This is a platform health state and should also be surfaced independently from per-resource drift.

Owned Fields

Do not compare every Azure field.

For the initial Azure Storage Account scope, compare only the fields the IDP deliberately owns, such as:

location,
account kind,
account tier,
replication type,
access tier,
minimum TLS version,
public network access,
blob public access,
shared access key enablement,
cross-tenant replication enablement,
selected platform-owned tags.

Provider-generated fields, timestamps, endpoints, and other computed values should not be treated as drift.

Sync-Back Strategy

The normal remediation path should be:

show the drift delta,
create a remediation request,
update desired state,
patch Crossplane-managed desired state,
let Crossplane reconcile Azure.

Direct Azure mutation should not be the standard sync-back mechanism.

This keeps the platform lifecycle coherent and preserves auditability.

Disaster Recovery Strategy

Primary Path: Velero Restore

Velero should be treated as the primary restore path for cluster-side desired state.

Back up at minimum:

crossplane-system,
idp-system,
Crossplane CRDs and related cluster-scoped objects,
Provider objects,
ProviderConfig and ClusterProviderConfig,
provider secrets,
dynamically created managed resources,
connection secrets.

Restore order should be:

rebuild AKS,
reinstall Crossplane and providers,
restore provider secrets and configs,
restore Crossplane runtime objects,
restore dynamic managed resources,
restore the IDP application.

Fallback Path: Adopt Existing Azure Resources

If dynamic Crossplane objects are missing after disaster:

rebuild AKS,
reinstall Crossplane and providers,
restore provider auth,
rediscover Azure resources,
recreate managed resources in observe-only mode,
validate identity and status,
promote to normal management.

This is slower and more operationally expensive than a Velero restore, so it should remain the fallback, not the primary plan.

Required Metadata

For each provisioned resource, the IDP should preserve enough metadata to restore or adopt it safely:

request ID,
workflow type,
desired specification,
Crossplane apiVersion, kind, and object name,
external-name annotation where relevant,
provider-config reference,
Azure ARM ID,
subscription ID,
resource group,
location,
ownership metadata,
remediation history.

The platform should also tag Azure resources with IDP ownership metadata so Cloud-only resources can be detected safely.

Current Milestone 1 Runtime Configuration

The first implementation slice adds durable submission persistence before the platform writes to Kubernetes.

The current runtime supports two persistence adapters:

Azure Blob Storage when both of these environment variables are configured:
- IDP_DESIRED_STATE_AZURE_BLOB_ACCOUNT_URL
- IDP_DESIRED_STATE_AZURE_BLOB_CONTAINER
Filesystem fallback when Blob configuration is not present.

The filesystem fallback is useful for local development and test execution. It should not be treated as the long-term production source of truth.

An optional filesystem path override is also supported:

IDP_DESIRED_STATE_FILESYSTEM_PATH

Workstreams

The implementation is split into six workstreams.

Source of truth
Submission pipeline
Azure observation
Classification engine
Sync-back remediation
UI read model and health

Recommended Order

add durable desired-state storage,
persist workflow intent before Crossplane apply,
add Azure observation for storage accounts,
compute classification and drift,
expose status in the UI,
add sync-back,
validate Velero restore and adoption fallback.

Initial Scope

The first implementation slice should cover only Azure Storage Accounts because the repository already has:

a storage workflow,
a storage resource list,
a storage resource details page,
a direct Crossplane API integration for that resource type.

That first slice should deliver:

durable desired-state persistence,
Azure live-state observation,
drift classification,
drift details in the UI,
sync-back through Crossplane.

Milestone Baseline

Milestone 1: Durable Intent

desired state is stored before Kubernetes apply,
resource requests have stable IDs,
Crossplane and Azure binding metadata are preserved.

Milestone 2: Cloud Observation

Azure storage accounts are observable from the platform,
list and detail views can show Azure-backed status,
orphaned cloud resources can be surfaced.

Milestone 3: Drift Classification

the platform can classify storage accounts correctly,
the UI can show field-level drift.

Milestone 4: Safe Remediation

users can request sync-back,
sync-back routes through Crossplane,
remediation is blocked or queued when the control plane is unavailable.

Milestone 5: DR Readiness

Velero covers the required Crossplane and IDP state,
restore order is documented and validated,
adoption fallback is documented for partial recovery.

Current Recommendation

For this repository, the practical next move is:

implement durable desired-state persistence,
implement Azure observation for storage accounts,
document and validate Velero restore scope,
then build classification and sync-back on top of those foundations.

Problem Statement​

Target Outcome​

Core Architecture​

Control Planes​

1. Desired State Store​

2. Crossplane Reconciliation Plane​

3. Azure Observation Plane​

4. UI Read Model​

Classification Model​

Recommended Interpretation​

Owned Fields​

Sync-Back Strategy​

Disaster Recovery Strategy​

Primary Path: Velero Restore​

Fallback Path: Adopt Existing Azure Resources​

Required Metadata​

Current Milestone 1 Runtime Configuration​

Workstreams​

Recommended Order​

Initial Scope​

Milestone Baseline​

Milestone 1: Durable Intent​

Milestone 2: Cloud Observation​

Milestone 3: Drift Classification​

Milestone 4: Safe Remediation​

Milestone 5: DR Readiness​

Current Recommendation​