Service Degradation Policy
Incidents rarely recover because of one perfect switch. A team may reduce concurrency, watch the queue, pause non-critical work, try a fallback provider, then tighten or relax the posture as the system responds. I want those moves to be reviewed and reversible, but I do not want to redeploy the service for every adjustment during recovery.
Rototo fits the policy layer in that loop. The service still owns metrics, queues, retries, provider health, and enforcement. Rototo selects the reviewed operating policy from the runtime facts the service supplies.
We will model that as degradation-config, with one variable named
service-degradation-policy.
Start With The Recovery Boundary
The runtime question is not "is the service healthy?" Rototo should not decide that. The service and observability system already know queue pressure, provider health, error rates, and retry behavior.
The runtime question I want is:
Given the service state we already measured, which reviewed operating policy
should this request use?
The first version of that policy can be small: run normally while pressure is normal, and reduce load when queue pressure is high.
Create The Workspace
Create the workspace with a variable and a resource template:
rototo init degradation-config --variable service-degradation-policy
rototo init degradation-config --resource service-degradation-policy
Replace degradation-config/variables/service-degradation-policy.toml:
schema_version = 1
description = "Operating policy selected while the service is under pressure"
type = "resource:service-degradation-policy"
[resolve]
default = "normal"
Replace degradation-config/resources/service-degradation-policy.toml:
schema_version = 1
description = "Service degradation policy objects"
schema = "../schemas/service-degradation-policy.schema.json"
The variable chooses a policy key. The resource validates the policy object behind that key. During an incident, the app should not have to trust a half-shaped object while operators are making fast changes.
Define The Policy Shape
Before adding policies, define the knobs the service is willing to apply.
Replace degradation-config/schemas/service-degradation-policy.schema.json:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"required": [
"mode",
"max_concurrency",
"background_jobs_enabled",
"non_critical_fanout",
"fallback_provider"
],
"properties": {
"mode": { "type": "string", "enum": ["normal", "degraded", "severe"] },
"max_concurrency": { "type": "integer", "minimum": 1, "maximum": 200 },
"background_jobs_enabled": { "type": "boolean" },
"non_critical_fanout": { "type": "string", "enum": ["send", "defer", "pause"] },
"fallback_provider": { "type": "string", "enum": ["primary", "secondary"] }
},
"additionalProperties": false
}
The schema is deliberately operational. It says which fields the service will
honor, which modes are allowed, and how far concurrency can be pushed. If
someone tries to set max_concurrency = 0 during an incident, lint catches
that before the workspace is released.
Add The First Policies
Rename the generated object file from
degradation-config/resources/service-degradation-policy-objects/default.toml
to degradation-config/resources/service-degradation-policy-objects/normal.toml,
then replace its contents:
mode = "normal"
max_concurrency = 100
background_jobs_enabled = true
non_critical_fanout = "send"
fallback_provider = "primary"
Create
degradation-config/resources/service-degradation-policy-objects/degraded.toml:
mode = "degraded"
max_concurrency = 30
background_jobs_enabled = false
non_critical_fanout = "defer"
fallback_provider = "primary"
Create
degradation-config/resources/service-degradation-policy-objects/severe.toml:
mode = "severe"
max_concurrency = 10
background_jobs_enabled = false
non_critical_fanout = "pause"
fallback_provider = "secondary"
The severe policy is not selected yet. I still like defining it early because
it gives reviewers a concrete recovery posture to inspect before the team needs
it under pressure.
Select Degraded During High Pressure
Now add the condition that moves the service from normal to degraded mode.
Create degradation-config/qualifiers/high-queue-pressure.toml:
schema_version = 1
description = "Service reports high queue pressure"
[[predicate]]
attribute = "service.queue_pressure"
op = "eq"
value = "high"
Update degradation-config/variables/service-degradation-policy.toml:
schema_version = 1
description = "Operating policy selected while the service is under pressure"
type = "resource:service-degradation-policy"
[resolve]
default = "normal"
[[resolve.rule]]
qualifier = "high-queue-pressure"
value = "degraded"
The app still decides when queue pressure is high. Rototo only turns that runtime fact into the reviewed policy object.
Generate The First Context Contract
The qualifier introduced service.queue_pressure. Generate the context schema
after that path exists:
rototo init degradation-config --context
On this workspace, rototo writes
degradation-config/schemas/context.schema.json:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"additionalProperties": true,
"properties": {
"service": {
"additionalProperties": true,
"properties": {
"queue_pressure": { "type": "string" }
},
"type": "object"
}
},
"type": "object"
}
Lint the workspace:
rototo lint degradation-config
Then resolve both paths.
Normal pressure selects normal:
rototo resolve degradation-config \
--variable service-degradation-policy \
--context service.queue_pressure=normal
value key: normal
value: {"background_jobs_enabled":true,"fallback_provider":"primary","max_concurrency":100,"mode":"normal","non_critical_fanout":"send"}
High pressure selects degraded:
rototo resolve degradation-config \
--variable service-degradation-policy \
--context service.queue_pressure=high
value key: degraded
value: {"background_jobs_enabled":false,"fallback_provider":"primary","max_concurrency":30,"mode":"degraded","non_critical_fanout":"defer"}
This is the first recovery move: reduce work everywhere that reports high pressure.
Try A Stronger Policy On A Bucket
Sometimes the first move is not enough. Queue depth keeps climbing, the
primary provider stays slow, or deferred work is still taking too much capacity.
The next move might be severe, but applying it to every account at once can
be more disruption than the team needs.
A bucket gives us a stable trial path. The same account stays in or out of the trial while the salt and range stay the same, so logs and support cases remain explainable.
Create degradation-config/qualifiers/degradation-trial-bucket.toml:
schema_version = 1
description = "Stable account bucket for trying a stronger recovery policy"
[[predicate]]
attribute = "account.id"
op = "bucket"
salt = "service-degradation-recovery-2026-06"
range = [0, 1000]
The bucket range is on a 0 to 10000 scale, so [0, 1000] is ten percent.
Now compose the bucket with high pressure.
Create degradation-config/qualifiers/severe-recovery-trial.toml:
schema_version = 1
description = "High-pressure requests in the severe recovery trial bucket"
[[predicate]]
attribute = "qualifier.high-queue-pressure"
op = "eq"
value = true
[[predicate]]
attribute = "qualifier.degradation-trial-bucket"
op = "eq"
value = true
Update the variable so the severe trial wins before the broader degraded rule:
schema_version = 1
description = "Operating policy selected while the service is under pressure"
type = "resource:service-degradation-policy"
[resolve]
default = "normal"
[[resolve.rule]]
qualifier = "severe-recovery-trial"
value = "severe"
[[resolve.rule]]
qualifier = "high-queue-pressure"
value = "degraded"
Rule order carries the recovery intent. High-pressure requests in the trial
bucket get severe; the rest of the high-pressure traffic stays on
degraded.
Regenerate The Context Contract
The bucket introduced account.id. Regenerate the context schema and review
the diff:
rototo init degradation-config --context --force
On this workspace, the regenerated
degradation-config/schemas/context.schema.json includes both runtime facts:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"additionalProperties": true,
"properties": {
"account": {
"additionalProperties": true,
"properties": {
"id": { "type": ["boolean", "number", "string"] }
},
"type": "object"
},
"service": {
"additionalProperties": true,
"properties": {
"queue_pressure": { "type": "string" }
},
"type": "object"
}
},
"type": "object"
}
Lint again:
rototo lint degradation-config
Resolve The Trial Paths
acct-0001 is outside the severe trial bucket, so high pressure still selects
degraded:
rototo resolve degradation-config \
--variable service-degradation-policy \
--context service.queue_pressure=high \
--context account.id=acct-0001
test: bucket salt=service-degradation-recovery-2026-06 range=[0,1000] bucket=5274
value key: degraded
acct-001 is inside the bucket, so the same high-pressure state selects
severe:
rototo resolve degradation-config \
--variable service-degradation-policy \
--context service.queue_pressure=high \
--context account.id=acct-001
test: bucket salt=service-degradation-recovery-2026-06 range=[0,1000] bucket=540
value key: severe
value: {"background_jobs_enabled":false,"fallback_provider":"secondary","max_concurrency":10,"mode":"severe","non_critical_fanout":"pause"}
This is the second recovery move: try the stronger policy on a stable slice while the rest of the pressured traffic stays on the first degraded policy.
Iterate Through Review
Recovery may need a few variations. Because the policy lives in the workspace, each variation can be a small reviewed diff.
To widen the severe policy without reshuffling account assignment, keep the salt and expand the range:
[[predicate]]
attribute = "account.id"
op = "bucket"
salt = "service-degradation-recovery-2026-06"
range = [0, 3000]
To make the severe policy stronger without widening it, change the policy object:
mode = "severe"
max_concurrency = 5
background_jobs_enabled = false
non_critical_fanout = "pause"
fallback_provider = "secondary"
To roll back the trial, remove the severe-recovery-trial rule or move the
bucket range back down. The service can refresh the workspace and apply the new
policy to future resolutions while the last successfully loaded workspace
stays active if a refresh fails.
Rototo does not decide whether the variation worked. The service metrics, alerts, dashboards, and incident process still answer that. Rototo makes the policy change reviewed, typed, reproducible, and reversible.
Use The Policy In The App
The app should resolve the policy where it is about to apply concurrency, fanout, background work, or provider routing. It should pass facts it already knows: current service pressure and the account ID used for stable assignment.
use serde::Deserialize;
use rototo::{ResolveContext, Workspace};
#[derive(Debug, Deserialize)]
struct DegradationPolicy {
mode: String,
max_concurrency: u64,
background_jobs_enabled: bool,
non_critical_fanout: String,
fallback_provider: String,
}
async fn degradation_policy(
workspace: &Workspace,
queue_pressure: &str,
account_id: &str,
) -> Result<DegradationPolicy, Box<dyn std::error::Error>> {
let context = ResolveContext::from_json(serde_json::json!({
"service": {
"queue_pressure": queue_pressure
},
"account": {
"id": account_id
}
}))?;
let resolution = workspace
.resolve_variable("service-degradation-policy", &context)
.await?;
let value_key = resolution.value_key.clone();
let policy: DegradationPolicy = serde_json::from_value(resolution.value)?;
println!(
"selected service-degradation-policy `{}` from {:?}",
value_key,
workspace.source_fingerprint()
);
Ok(policy)
}
The selected policy is not the incident state. It is one input to the service's own backpressure and routing behavior.
Keep The Control Loop Clear
At this boundary, rototo should own:
- reviewed degradation modes;
- concurrency and fanout policy;
- fallback-provider selection;
- stable buckets for trying a stronger recovery posture;
- reversible changes during recovery.
Keep these in the service, observability system, or incident process:
- queue depth measurement;
- provider health detection;
- retry scheduling;
- per-request execution;
- metrics that show whether recovery is working;
- incident ownership and customer communication.
That is the split that keeps recovery sane. The service keeps running the live control loop. Rototo gives the team reviewed policy it can change, observe, widen, tighten, and roll back without changing the application binary.