Gemma 4 26B A4B is an instruction-tuned Mixture-of-Experts model from Google DeepMind. About 25B parameters total with roughly 3.8B active per token, so you get strong quality with lower inference cost than a dense model of similar capability. It supports multimodal input (text, images, and short video), function calling, and a 23K context window.
Gemma 4 26B A4B - NVIDIA RTX 4090
Why teams pick Gemma 4 26B A4B over Claude Sonnet 4.5
Run it on your own GPU with predictable flat pricing — no per-token API meter running in the background.
Apache 2.0 weights you can fine-tune, audit, and keep inside your network instead of routing prompts through a hosted API.
MoE architecture activates only 3.8B parameters per token, so you get strong reasoning quality without paying for a full dense 27B+ API bill.
Gemma 4 26B A4B IT is an instruction-tuned Mixture-of-Experts (MoE) model from Google DeepMind. Despite 25.2B total parameters, only 3.8B activate per token during inference — delivering near-31B quality at a fraction of the compute cost. Supports multimodal input including text, images, and video (up to 60s at 1fps). Features a 256K token context window, native function calling, configurable thinking/reasoning mode, and structured output support. Released under Apache 2.0.
Model Cost Across Durations
Live pack pricing vs typical API estimates from 11 hours through 1 month.
API estimates for GPT-4.1 and Claude Sonnet 4.5 vs Gemma 4 26B A4B on RTX 4090.
Time pack
24 hours cost
$22
Lowest
24 hours cost
$33.87
Save $11.87 vs our model
24 hours cost
$58.06
Save $36.06 vs our model
Models in chart
- Gemma 4 26B A4B on RTX 4090
- GPT-4.1
- Claude Sonnet 4.5
At a glance
Benchmarks
Performance metrics for Gemma 4 26B A4B (Reasoning). Source: Artificial Analysis.
Performance indexes
Benchmark scores
Apps & integrations
Choose an app below. Each guide shows how to point the app at your OpenAI-compatible endpoint.
FAQ
Frequently asked questions
Common questions about Gemma 4 26B A4B, deployment, and using it on OpenLLM Buddy.
6 questions
Click Deploy on this page or open the console, pick a time pack, and launch an instance. When the pod is ready, you receive an OpenAI-compatible chat-completions URL and an API key scoped to your deployment.
Gemma 4 26B A4B is released under . You can run it on OpenLLM Buddy for production workloads; check the upstream model card for any additional attribution or usage notes from Google.
Use the OpenAI Chat Completions format against your instance URL. Set the model field to gemma4:26b in the JSON body—the same identifier shown at the top of this page. Official OpenAI SDKs work with a custom base URL and your API key.
You pay a flat rate per deployment pack (for example hourly or daily), not per token. See the Pricing tab for current pack prices. Token usage is metered for fairness and limits, but billing is tied to the pack you choose at deploy time.
Yes. The model supports configurable thinking/reasoning modes and native function calling where enabled in your request. Check the API tab for sample payloads and the Specs tab for capability details.
Ready to try it? Deploy Gemma 4 26B A4B · Browse models