MLX chat model lifecycle (Apple Silicon)

On-device MLX runtime

The MLX chat backend runs locally on Apple Silicon. Weights and LoRA adapters are loaded on demand and kept hot while in use.
Idle unload window

After a period of inactivity, ragweld can release MLX weights to free memory. A new request reloads them automatically.
Single pending timer

Re-scheduling the idle unload cancels the prior timer so only one unload task is ever pending. This avoids “timer pileups” and premature unloads.
Operator-safe defaults

If you’re not sure, keep the default idle window. Tune it only if you see memory pressure or frequent reloads in logs.

Operations overview Configuration API Observability

Pydantic is the law

Lifecycle behavior is driven by configuration and model adapters defined in server/models/tribrid_config_model.py and the MLX adapter in server/chat/ragweld_mlx.py. If a knob isn’t in Pydantic, it isn’t officially supported.

Why MLX lifecycle management exists

On Apple Silicon, the MLX backend provides a fast, on-device chat model experience. However, large weights consume substantial unified memory. ragweld manages the model’s lifecycle to balance:

responsiveness (keep weights hot during active use)
stability (release memory when idle)
correctness (never unload while generations are running)

What “idle unload” does

Definition list: Idle unload window
The number of seconds of inactivity after which ragweld unloads MLX weights. New activity resets the countdown.; In-use guard
The model will not unload while generations are running. Active use cancels or defers unloading.; Single pending timer
Only one unload task is allowed to be scheduled at a time. If a new unload schedule is requested, the previously scheduled one is cancelled.; Generation safety
The unload routine verifies that the conditions that triggered the timer still hold (still idle) before it frees memory.

If you’re not sure, do this

Leave the idle unload window at its default.
Revisit only if you observe frequent model reloads during normal usage or memory pressure during idle periods.

Operator expectations (signals and failure modes)

Logs: You should see clear log messages when MLX weights are loaded and when they are unloaded due to idleness.
Memory telemetry: Expect memory to remain elevated while the model is in use, then drop after the idle window elapses. On Apple Silicon, “returning” memory can lag in system monitors due to allocator behavior.
Cold-starts: After an unload, the next request will re-load weights. This may add a short cold-start delay.

Don’t fight the scheduler with external timers

Avoid implementing your own unload timers in sidecars or wrappers. The built-in scheduler already guarantees a single pending timer and guards against unloading during active use.

Tuning guidance

Use these rules of thumb when adjusting the idle window:

Interactive chats (human-in-the-loop)
Favor a slightly longer idle window to avoid reloading between short pauses.
Tradeoff: higher memory residency while a user is present.
Batch or scheduled usage
A shorter idle window can free memory sooner between jobs.
Tradeoff: jobs that arrive sporadically may pay more frequent cold-starts.

Disabling idle unload (advanced)

The MLX adapter’s internal scheduler treats non-positive values as “no idle unload”. This is useful for high-throughput runs where reloading would be a consistent tax. Only disable if you have the memory headroom.

What just changed (and why it matters)

ragweld now explicitly tracks the pending idle-unload task for the MLX chat model and cancels any previous one when a new schedule is requested. Practically:

multiple schedule calls result in only one active timer
the most recent schedule defines the next unload time
fewer stray asyncio tasks and less chance of unloading earlier than intended

This is validated by a unit test that asserts only a single pending timer exists after multiple re-schedules. The behavior lives in server/chat/ragweld_mlx.py and is covered by tests/unit/test_ragweld_mlx_lifecycle.py.

Operational impact

More predictable memory release timing under bursty usage
Fewer background tasks to manage in long-running processes
Lower risk of premature unloads during active sessions

Troubleshooting checklist

Weights never seem to unload
Ensure the model isn’t constantly in use (look for active generations).
Confirm the idle window isn’t disabled (advanced configurations may set it to non-positive).
Model reloads too often
Increase the idle window.
Confirm that request bursts aren’t spaced just beyond the current window.
Memory doesn’t drop immediately after unload
Apple Silicon’s unified memory and allocator behavior may delay visible release in some tools. Correlate with logs.

Deep dive: what happens under the hood

During generation, the adapter tracks an “in use” count and last-used timestamp.
When generation completes, the adapter schedules an unload after the configured idle window.
If another request arrives, the adapter cancels the pending unload task and reschedules it after the new activity completes.
When the timer fires, the adapter re-checks whether it’s still safe to unload before actually freeing the weights.

Operations overview: practical monitoring and health surfaces
See Operations & metrics.
Observability: tracing, metrics, and dashboards
See Observability.
API-first integration: where to call from your app
See API.

File paths for engineers