Ollama v0.17.1: Nemotron Support, MLX Memory Fixes, Web Search Detection
Ollama v0.17.1 landed on February 24, 2026. The release extends model architecture coverage with native Nemotron support in the inference engine, tightens memory handling for Apple Silicon users running the MLX backend, and wires up automatic web search capability detection for models that expose tool-use interfaces [1].
Beyond the headline features, the release also addresses quantization behavior: ollama create no longer defaults to affine quantization for unquantized models on the MLX engine. LFM2 and LFM2.5 model compatibility received additional fixes. A new configuration flag lets users disable automatic update downloads—useful for air-gapped or version-pinned deployments [1].
What Shipped
| Change | Area | Impact |
|---|---|---|
| Nemotron architecture support | Engine | Run NVIDIA Nemotron models natively without external conversion |
| MLX memory improvements | Apple Silicon | Lower RAM footprint during inference on M-series chips |
| Web search capability detection | Tools | App auto-surfaces web search for tool-capable models |
| LFM2 / LFM2.5 fixes | Engine | Better compatibility with these model families |
| Affine quantization default removed | MLX | ollama create no longer force-quantizes unquantized models on MLX |
| Disable auto-update config | Operations | Opt out of automatic update downloads |
Nemotron Architecture
NVIDIA’s Nemotron family includes instruction-tuned models built on a modified transformer architecture. Native support in Ollama’s engine means these models can now run locally through standard ollama run commands without GGUF conversion workarounds or third-party adapters. This is relevant for teams evaluating Nemotron variants for on-premise deployment.
MLX and Apple Silicon
The MLX backend—Ollama’s Apple Silicon inference path—received two related changes. Memory usage during inference is reduced, and the ollama create command no longer applies affine quantization by default to unquantized models. The second change matters for users who want to run full-precision models on high-memory M-series machines without unexpected quality degradation from automatic quantization.
Web Search Detection
Models that advertise tool-use capabilities now trigger automatic web search availability in the Ollama application. This is a UI/API-level change: the runtime detects whether a loaded model supports function calling and exposes web search as an available tool. No manual configuration required—load a tool-capable model, and the option appears.
Operational Notes
The new auto-update disable flag is worth noting for production environments. Air-gapped setups, CI pipelines using pinned Ollama versions, and teams with change-control policies can now suppress background update downloads at the configuration level rather than relying on network-level blocks.
Full diff: v0.17.0 to v0.17.1 [1].
References
---
Configuration details reflect a production environment at time of writing. Implementation specifics vary based on tooling versions, platform updates, and organizational requirements. Validate approaches against current documentation before deployment.