Tuning local models

goose is ferrus's local-model-friendly backend: it's MCP-native, needs no project config file, and works well against a local LM Studio or Ollama endpoint set with goose configure (see Supported agents). opencode can also drive a local model, but only for the supervisor/ reviewer role today.

Dogfooding ferrus against local models — most extensively Qwen3-30B-A3B as the executor, with a Gemma MoE variant tested alongside it — turned up a handful of tuning knobs that matter far more than model choice for whether the executor loop stays stable. Qwen3-30B-A3B held up well end to end; the Gemma variant was workable but drifted on specific edge-case tasks. None of this is ferrus-specific — it's how these models are meant to be run — but it's easy to leave on defaults that actively hurt tool-calling reliability in an agent loop.

Why default settings hurt agent loops specifically

A chat session tolerates a model that occasionally rambles or repeats itself — you just re-prompt it. An unattended executor can't: a malformed tool call, a repetition loop, or a dropped <think> block turns into a wasted turn, a --max-tool-repetitions trip, or a respawned session (see max_executor_dispatches). The knobs below are ranked roughly by how much they reduce that failure mode, not by how much they improve raw benchmark scores.

1. Sampling — the most underrated lever

Qwen3 in thinking mode has official recommended sampling parameters. If you're running below them, this is usually the highest-leverage fix:

Parameter	Recommended
`temperature`	0.6
`top_p`	0.95
`top_k`	20
`min_p`	0
`presence_penalty`	1.0–1.5

A temperature noticeably below 0.6 (e.g. 0.4) is below what Qwen3 was tuned for, and the model's own guidance warns that low temperature in thinking mode provokes looping and repeated output. If you're seeing --max-tool-repetitions trip or unexplained executor respawns, check sampling before anything else. presence_penalty in the 1.0–1.5 range pushes back directly on loop behavior; above ~1.5 risks language mixing in the output.

2. Context length — probably hurting you for free

Qwen3-30B-A3B's native context is 32K, extendable via YaRN to 131K. Setting context far beyond that (e.g. 256K) forces aggressive RoPE scaling that measurably degrades quality on short tasks — and most agent tasks are short relative to the window. Unless a task genuinely needs a huge context, 32–64K is both more correct and frees memory for a better quantization (below). This is close to a free win: better output and lower memory pressure from one change.

3. Quantization: 4-bit → 6-bit (not 8-bit)

Qwen3-30B-A3B is a mixture-of-experts model with only ~3B active parameters per forward pass. MoE models with a small active path are hit harder by quantization than a dense model of the same total size — few weights do the work on any given token, so each one "weighs more."

4 → 6-bit (Q6_K / 6-bit MLX) is a real jump, recovering most of the gap to fp16. It mainly cuts "dumb" mistakes and malformed tool-call formatting — it sharpens reliability, it doesn't add capability.
6 → 8-bit is marginal and rarely worth the extra memory/speed cost.

Rough memory budget (weights only, plus KV cache on top): 4-bit ≈ 17GB, 6-bit ≈ 25GB, 8-bit ≈ 32GB. Freeing headroom by trimming context (above) is what usually makes 6-bit affordable.

4. Structured output (constrained decoding)

Constraining tool-call output to a schema reduces malformed/invalid calls, which directly means fewer wasted turns and fewer executor respawns — this affects format reliability, not reasoning quality. Worth enabling for goose, which drives native tool-calling. The tradeoff: an overly rigid grammar can occasionally constrain reasoning quality, so A/B it on a representative task rather than assuming it's a free win.

5. Thinking mode vs. preserving thinking — two different toggles

These are easy to conflate and have opposite recommendations:

Thinking mode itself (enable_thinking / /think) should be on for anything beyond trivial edge cases — it gives a real correctness boost on subtle contract-level bugs. Double-check your runtime (goose, LM Studio, …) isn't silently disabling it.
Preserving thinking — carrying previous turns' <think> blocks forward in multi-turn history — should stay off. Qwen3's own guidance says to strip reasoning from history in multi-turn use: the model isn't trained to consume its own past thinking, and it just bloats context for no benefit.

6. Smaller wins

KV-cache quantization (Q8) — close to lossless, frees meaningful memory for the model weights or context.
Flash attention — enable it; no real downside on supported hardware.
Spend a fixed memory budget in this order: 6-bit weights → 32–64K context → Q8 KV cache, rather than maxing out any one of them first.

Recommended starting point

Parameter	Common default	Try instead
Quantization	4-bit	6-bit
Context	256K	32–64K
`temperature`	0.4	0.6
`top_p` / `top_k` / `min_p`	unset	0.95 / 20 / 0
`presence_penalty`	0	1.0–1.5
Structured output	off	on (A/B on a real task)
Preserve thinking	off	leave off
Thinking mode	unverified	confirm it's on

How to A/B this

Change one variable at a time and re-run the same representative executor task before changing the next — otherwise you can't tell which change actually helped. Rough order of expected impact:

Sampling per the table above (especially presence_penalty) + trimming context to 32–64K
4-bit → 6-bit quantization
Structured output

Leave "preserve thinking" off throughout.

These knobs raise the reliability ceiling — fewer malformed calls, fewer loops, fewer respawns. They don't raise the reasoning ceiling: on subtle, contract-level bugs, the biggest remaining lever is still sharpening the task description itself, the same way it would be for a hosted model.

Gemma and other local models

The same shape of tuning applies to other local backends, but check the specific model's own recommended sampling defaults rather than reusing Qwen3's numbers verbatim — they differ by model family. If a model performs well generally but "drifts" on specific edge cases, that's usually a sign to revisit sampling and context first before concluding the model itself is the limiting factor.

Why default settings hurt agent loops specifically​

1. Sampling — the most underrated lever​

2. Context length — probably hurting you for free​

3. Quantization: 4-bit → 6-bit (not 8-bit)​

4. Structured output (constrained decoding)​

5. Thinking mode vs. preserving thinking — two different toggles​

6. Smaller wins​

Recommended starting point​

How to A/B this​

Gemma and other local models​