$ For millions of years mankind lived just like animals. Then something happened which unleashed the power of our imagination.

Ollama v0.17.5 Patches Qwen 3.5 Crashes and MLX Memory Issues

ollama, qwen, local-inference, mlx, bug-fix

Ollama shipped v0.17.5 on March 12, 2026, a patch release targeting stability problems introduced with Qwen 3.5 support. The release fixes two distinct Qwen 3.5 issues — a crash when model layers are split between GPU and CPU, and a repetition loop caused by the absence of a presence penalty parameter. Users who previously downloaded Qwen 3.5 weights may need to re-pull them (e.g., ollama pull qwen3.5:35b) to pick up the corrected model configuration.

The MLX engine also received attention: memory-related crashes are resolved, and ollama run --verbose now reports peak memory consumption during MLX inference. A separate fix addresses failures when importing Qwen 3.5 models from standalone GGUF files.

Qwen 3.5 Model Support

This release coincides with availability of the Qwen 3.5 “small” series in Ollama’s library, spanning four parameter counts: 0.8B, 2B, 4B, and 9B. The GPU/CPU split crash would have affected anyone running these models on hardware where VRAM is insufficient to hold all layers — a common scenario for the 9B variant on consumer GPUs with 8GB or less.

The repetition bug is worth noting: Qwen 3.5 model definitions shipped without a presence penalty value, causing the model to loop on generated tokens. This is a metadata-level fix, which is why re-pulling the model files is necessary rather than just updating the Ollama binary.

MLX Engine Fixes

FixDetail
Memory crashesResolved instability in the MLX runner during inference
Peak memory reporting--verbose flag now surfaces peak memory usage for MLX runs
GGUF importQwen 3.5 GGUF files imported from external sources now load correctly

MLX is Ollama’s inference backend for Apple Silicon. The memory reporting addition is a practical improvement for users profiling model performance on M-series hardware.

Implications

For anyone running local inference with Qwen 3.5 models — particularly in mixed GPU/CPU configurations — this is a mandatory update. The re-pull requirement means automation pipelines that pin model versions should account for the updated weights. MLX users on macOS gain better observability into memory consumption, which matters for capacity planning on memory-constrained laptops.

The full diff is available in the changelog between v0.17.4 and v0.17.5 [1].

References

  1. Ollama v0.17.5 Release Notes
  2. Qwen 3.5 on Ollama Library

---

Configuration details reflect a production environment at time of writing. Implementation specifics vary based on tooling versions, platform updates, and organizational requirements. Validate approaches against current documentation before deployment.