Ollama v0.17.1: Nemotron Support, MLX Memory Fixes, Web Search Detection

March 12, 2026 ollama, local-inference, mlx, nemotron, open-source

Ollama v0.17.1 landed on February 24, 2026. The release extends model architecture coverage with native Nemotron support in the inference engine, tightens memory handling for Apple Silicon users running the MLX backend, and wires up automatic web search capability detection for models that expose tool-use interfaces [1].

Beyond the headline features, the release also addresses quantization behavior: ollama create no longer defaults to affine quantization for unquantized models on the MLX engine. LFM2 and LFM2.5 model compatibility received additional fixes. A new configuration flag lets users disable automatic update downloads—useful for air-gapped or version-pinned deployments [1].

What Shipped

Change	Area	Impact
Nemotron architecture support	Engine	Run NVIDIA Nemotron models natively without external conversion
MLX memory improvements	Apple Silicon	Lower RAM footprint during inference on M-series chips
Web search capability detection	Tools	App auto-surfaces web search for tool-capable models
LFM2 / LFM2.5 fixes	Engine	Better compatibility with these model families
Affine quantization default removed	MLX	`ollama create` no longer force-quantizes unquantized models on MLX
Disable auto-update config	Operations	Opt out of automatic update downloads

Nemotron Architecture

NVIDIA’s Nemotron family includes instruction-tuned models built on a modified transformer architecture. Native support in Ollama’s engine means these models can now run locally through standard ollama run commands without GGUF conversion workarounds or third-party adapters. This is relevant for teams evaluating Nemotron variants for on-premise deployment.

MLX and Apple Silicon

The MLX backend—Ollama’s Apple Silicon inference path—received two related changes. Memory usage during inference is reduced, and the ollama create command no longer applies affine quantization by default to unquantized models. The second change matters for users who want to run full-precision models on high-memory M-series machines without unexpected quality degradation from automatic quantization.

Web Search Detection

Models that advertise tool-use capabilities now trigger automatic web search availability in the Ollama application. This is a UI/API-level change: the runtime detects whether a loaded model supports function calling and exposes web search as an available tool. No manual configuration required—load a tool-capable model, and the option appears.

Operational Notes

The new auto-update disable flag is worth noting for production environments. Air-gapped setups, CI pipelines using pinned Ollama versions, and teams with change-control policies can now suppress background update downloads at the configuration level rather than relying on network-level blocks.

Full diff: v0.17.0 to v0.17.1 [1].

References

Ollama v0.17.1 Release Notes — GitHub

---

Configuration details reflect a production environment at time of writing. Implementation specifics vary based on tooling versions, platform updates, and organizational requirements. Validate approaches against current documentation before deployment.

← Back to Journal